Tell HN: We should snapshot a mostly AI output free version of the web

136 点作者 jacquesm大约 1 年前

While we can, and if it isn't too late already. The web is overrun with AI generated drivel, I've been searching for information on some widely varying subjects and I keep landing in recently auto-generated junk. Unfortunately most search engines associate 'recency' with 'quality' or 'relevance' and that is very much no longer true.While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as 'low background steel'.https://en.wikipedia.org/wiki/Low-background_steel

27 条评论

simonw大约 1 年前

Sounds like you want Common Crawl - they have snapshots going back to 2013, take your pick: <a href="https://data.commoncrawl.org/crawl-data/index.html" rel="nofollow">https://data.commoncrawl.org/crawl-data/index.html</a>(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)

评论 #40059523 未加载

vitovito大约 1 年前

2024 might already be too late, since this sentiment has been shared since at least 2021:2021: <a href="https://twitter.com/jackclarkSF/status/1376304266667651078" rel="nofollow">https://twitter.com/jackclarkSF/status/1376304266667651078</a>2022: <a href="https://twitter.com/william_g_ray/status/1583574265513017344" rel="nofollow">https://twitter.com/william_g_ray/status/1583574265513017344</a>2022: <a href="https://twitter.com/mtrc/status/1599725875280257024" rel="nofollow">https://twitter.com/mtrc/status/1599725875280257024</a>Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: <a href="https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/index.html" rel="nofollow">https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde...</a> with their previous and subsequent crawls listed in the dropdown here: <a href="https://commoncrawl.org/overview" rel="nofollow">https://commoncrawl.org/overview</a>Internet Archive's crawls are here: <a href="https://archive.org/details/web" rel="nofollow">https://archive.org/details/web</a> organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: <a href="https://archive.org/details/wide00018" rel="nofollow">https://archive.org/details/wide00018</a>. Wide Crawl 17 was from late 2018 and is 644.4TB: <a href="https://archive.org/details/wide00017" rel="nofollow">https://archive.org/details/wide00017</a>

评论 #40058957 未加载

talldayo大约 1 年前

> I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs onThey probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.

评论 #40058835 未加载

评论 #40058655 未加载

评论 #40058784 未加载

评论 #40058653 未加载

uyzstvqs大约 1 年前

I've posted this recently on another post as well, but before AI-generated spam there was content farm spam. This has been increasing in search results and on social networking sites for years now.The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.

potatoman22大约 1 年前

I feel like archive.org and The Pile have this covered, no?

评论 #40058605 未加载

Zenzero大约 1 年前

This implies that the pre-AI internet wasn't already overrun with SEO optimized junk. Much of the internet is not worth preserving.

评论 #40059886 未加载

skybrian大约 1 年前

SEO content farms have been publishing for decades now.

signaru大约 1 年前

Alternatively, searching has to be changed. The non AI content doesn't necessarily disappear, but are gradually becoming "hidden gems". Something like Marginalia which does this for SEO noise would be nice.

jdswain大约 1 年前

At least I think I can tell when I am reading AI generated content, and stop reading and go somewhere else. Eventually though it'll get better to the point where it'll be hard to tell, but maybe then it's also good enough to be worth reading?

评论 #40059581 未加载

anigbrowl大约 1 年前

I don't really have this problem because I habitually use the Tools option on Google (or equivalent on other search engines like DDG) to only return information from before a certain date. It's not flawless, as some media companies use a more or less static URL that they update frequently, but SEO-optimizers like this are generally pretty easy to screen out.That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.

aaronblohowiak大约 1 年前

Internet archive?

评论 #40058472 未加载

neilk大约 1 年前

Using "before:2023" in your Google query helps. For now.A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.<a href="https://udongein.xyz/notice/AcwmRcIzxOLmrSamum" rel="nofollow">https://udongein.xyz/notice/AcwmRcIzxOLmrSamum</a>There are some obvious problems with it, but I think I'd still like to see what that would look like.

评论 #40060404 未加载

giantg2大约 1 年前

Sure, we can take a snapshot of our bot filled web today before it goes true AI. Not sure what the real benefit would be.

dudus大约 1 年前

I have a sliver of hope AI generated content will actually be good one day. Just like I believe automated cars will be better than humans. I have nothing against reading content that was written by AI, for some of my reading.

ccgreg大约 1 年前

I've been giving talks about Common Crawl for the last year with a slide about exactly this, using low background steel as an example.

greyzor7大约 1 年前

that's what archive.org already does, but if you want to re-implement it, you would have to crawl all the web, eventually save thumbnails of pages with screenshotone (<a href="https://microlaunch.net/p/screenshotone" rel="nofollow">https://microlaunch.net/p/screenshotone</a>)

wseqyrku大约 1 年前

> recently auto-generated junkthis would only apply for pre-agi era though

MattGaiser大约 1 年前

Is this really all that different from the procedurally generated drivel or the offshore freelance copy/paste generated drivel?I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.

metadat大约 1 年前

Reality is a mess in a lot of ways. Unfortunately in this case, it's a bit late.Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.I'm not holding my breath.

jamesy0ung大约 1 年前

Internet Archive exists for webpages

acheron大约 1 年前

The web has been overrun by drivel for over two decades now.

评论 #40058691 未加载

mceoin大约 1 年前

Isn’t this common crawl?

RecycledEle大约 1 年前

It's way too late.

LorenDB大约 1 年前

r/Datahoarder probably already has you covered.

fuzztester大约 1 年前

Same seems to have been happening on hn from the last several months.had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.not sure if he was right because I still see evidence of such stuff.

alpenbazi大约 1 年前

yes

keepamovin大约 1 年前

Embrace it. Stop living in the past, Gatsby. Just ask ChatGPT for the answers you seek. Hahaha! :)What are you searching for anyway??