While we can, and if it isn't too late already. The web is overrun with AI generated drivel, I've been searching for information on some widely varying subjects and I keep landing in recently auto-generated junk. Unfortunately most search engines associate 'recency' with 'quality' or 'relevance' and that is very much no longer true.<p>While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as 'low background steel'.<p>https://en.wikipedia.org/wiki/Low-background_steel
Sounds like you want Common Crawl - they have snapshots going back to 2013, take your pick: <a href="https://data.commoncrawl.org/crawl-data/index.html" rel="nofollow">https://data.commoncrawl.org/crawl-data/index.html</a><p>(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)
2024 might already be too late, since this sentiment has been shared since at least 2021:<p>2021: <a href="https://twitter.com/jackclarkSF/status/1376304266667651078" rel="nofollow">https://twitter.com/jackclarkSF/status/1376304266667651078</a><p>2022: <a href="https://twitter.com/william_g_ray/status/1583574265513017344" rel="nofollow">https://twitter.com/william_g_ray/status/1583574265513017344</a><p>2022: <a href="https://twitter.com/mtrc/status/1599725875280257024" rel="nofollow">https://twitter.com/mtrc/status/1599725875280257024</a><p>Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.<p>Common Crawl's first crawl of 2020 contains 3.1B pages, and is around 100TB: <a href="https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/index.html" rel="nofollow">https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-05/inde...</a> with their previous and subsequent crawls listed in the dropdown here: <a href="https://commoncrawl.org/overview" rel="nofollow">https://commoncrawl.org/overview</a><p>Internet Archive's crawls are here: <a href="https://archive.org/details/web" rel="nofollow">https://archive.org/details/web</a> organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: <a href="https://archive.org/details/wide00018" rel="nofollow">https://archive.org/details/wide00018</a>. Wide Crawl 17 was from late 2018 and is 644.4TB: <a href="https://archive.org/details/wide00017" rel="nofollow">https://archive.org/details/wide00017</a>
> I'm pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on<p>They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.<p>Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.
I've posted this recently on another post as well, but before AI-generated spam there was content farm spam. This has been increasing in search results and on social networking sites for years now.<p>The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.
Alternatively, searching has to be changed. The non AI content doesn't necessarily disappear, but are gradually becoming "hidden gems". Something like Marginalia which does this for SEO noise would be nice.
At least I think I can tell when I am reading AI generated content, and stop reading and go somewhere else. Eventually though it'll get better to the point where it'll be hard to tell, but maybe then it's also good enough to be worth reading?
I don't really have this problem because I habitually use the Tools option on Google (or equivalent on other search engines like DDG) to only return information from before a certain date. It's not flawless, as some media companies use a more or less static URL that they update frequently, but SEO-optimizers like this are generally pretty easy to screen out.<p>That said it's a problem, even if it's just the latest iteration of an older problem like content farming, article spinners and so on. I've said for years that spam is the ultimate cancer and that the tech community's general indifference to spam and scams will be its downfall.
Using "before:2023" in your Google query helps. For now.<p>A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.<p><a href="https://udongein.xyz/notice/AcwmRcIzxOLmrSamum" rel="nofollow">https://udongein.xyz/notice/AcwmRcIzxOLmrSamum</a><p>There are some obvious problems with it, but I think I'd still like to see what that would look like.
I have a sliver of hope AI generated content will actually be good one day. Just like I believe automated cars will be better than humans. I have nothing against reading content that was written by AI, for some of my reading.
that's what archive.org already does, but if you want to re-implement it, you would have to crawl all the web, eventually save thumbnails of pages with screenshotone (<a href="https://microlaunch.net/p/screenshotone" rel="nofollow">https://microlaunch.net/p/screenshotone</a>)
Is this really all that different from the procedurally generated drivel or the offshore freelance copy/paste generated drivel?<p>I find that I get a lot more AI content, but it mostly displaced the original freelancer/procedurally generated spam.
Reality is a mess in a lot of ways.
Unfortunately in this case, it's a bit late.<p>Wouldn't it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.<p>I'm not holding my breath.
Same seems to have been happening on hn from the last several months.<p>had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.<p>not sure if he was right because I still see evidence of such stuff.