TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Tell HN: We should snapshot a mostly AI output free version of the web

136 pointsby jacquesmabout 1 year ago
While we can, and if it isn&#x27;t too late already. The web is overrun with AI generated drivel, I&#x27;ve been searching for information on some widely varying subjects and I keep landing in recently auto-generated junk. Unfortunately most search engines associate &#x27;recency&#x27; with &#x27;quality&#x27; or &#x27;relevance&#x27; and that is very much no longer true.<p>While there is still a chance I think we should snapshot a version of the web and make it publicly available. That can serve as something to calibrate various information sources against to get an idea of whether or not they are to be used or rather not. I&#x27;m pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on, and such data will rapidly become as precious as &#x27;low background steel&#x27;.<p>https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Low-background_steel

27 comments

simonwabout 1 year ago
Sounds like you want Common Crawl - they have snapshots going back to 2013, take your pick: <a href="https:&#x2F;&#x2F;data.commoncrawl.org&#x2F;crawl-data&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;data.commoncrawl.org&#x2F;crawl-data&#x2F;index.html</a><p>(A semi-ironic detail: Common Crawl is one of the most common sources used as part of the training data for LLMs)
评论 #40059523 未加载
vitovitoabout 1 year ago
2024 might already be too late, since this sentiment has been shared since at least 2021:<p>2021: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;jackclarkSF&#x2F;status&#x2F;1376304266667651078" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;jackclarkSF&#x2F;status&#x2F;1376304266667651078</a><p>2022: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;william_g_ray&#x2F;status&#x2F;1583574265513017344" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;william_g_ray&#x2F;status&#x2F;1583574265513017344</a><p>2022: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;mtrc&#x2F;status&#x2F;1599725875280257024" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;mtrc&#x2F;status&#x2F;1599725875280257024</a><p>Common Crawl and the Internet Archive crawls are probably the two most ready sources for this, you just have to define where you want to draw the line.<p>Common Crawl&#x27;s first crawl of 2020 contains 3.1B pages, and is around 100TB: <a href="https:&#x2F;&#x2F;data.commoncrawl.org&#x2F;crawl-data&#x2F;CC-MAIN-2020-05&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;data.commoncrawl.org&#x2F;crawl-data&#x2F;CC-MAIN-2020-05&#x2F;inde...</a> with their previous and subsequent crawls listed in the dropdown here: <a href="https:&#x2F;&#x2F;commoncrawl.org&#x2F;overview" rel="nofollow">https:&#x2F;&#x2F;commoncrawl.org&#x2F;overview</a><p>Internet Archive&#x27;s crawls are here: <a href="https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;web" rel="nofollow">https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;web</a> organized by source. Wide Crawl 18 is from mid-2021 and is 68.5TB: <a href="https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;wide00018" rel="nofollow">https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;wide00018</a>. Wide Crawl 17 was from late 2018 and is 644.4TB: <a href="https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;wide00017" rel="nofollow">https:&#x2F;&#x2F;archive.org&#x2F;details&#x2F;wide00017</a>
评论 #40058957 未加载
talldayoabout 1 year ago
&gt; I&#x27;m pretty sure Google, OpenAI and Facebook all have such snapshots stashed away that they train their AIs on<p>They probably just use publicly-available resources like The Pile. If newer training material becomes unusable for whatever reason, the old stuff still exists.<p>Paradoxically, I think a lot of research is showing that synthetic training information can be just as good as the real stuff. We may stumble upon an even stranger scenario where AI-generated content is more conducive to training than human content is.
评论 #40058835 未加载
评论 #40058655 未加载
评论 #40058784 未加载
评论 #40058653 未加载
uyzstvqsabout 1 year ago
I&#x27;ve posted this recently on another post as well, but before AI-generated spam there was content farm spam. This has been increasing in search results and on social networking sites for years now.<p>The solution is sticking to the websites you trust. And LLMs and RAG can actually make for a really good, very relevant search engine.
potatoman22about 1 year ago
I feel like archive.org and The Pile have this covered, no?
评论 #40058605 未加载
Zenzeroabout 1 year ago
This implies that the pre-AI internet wasn&#x27;t already overrun with SEO optimized junk. Much of the internet is not worth preserving.
评论 #40059886 未加载
skybrianabout 1 year ago
SEO content farms have been publishing for decades now.
signaruabout 1 year ago
Alternatively, searching has to be changed. The non AI content doesn&#x27;t necessarily disappear, but are gradually becoming &quot;hidden gems&quot;. Something like Marginalia which does this for SEO noise would be nice.
jdswainabout 1 year ago
At least I think I can tell when I am reading AI generated content, and stop reading and go somewhere else. Eventually though it&#x27;ll get better to the point where it&#x27;ll be hard to tell, but maybe then it&#x27;s also good enough to be worth reading?
评论 #40059581 未加载
anigbrowlabout 1 year ago
I don&#x27;t really have this problem because I habitually use the Tools option on Google (or equivalent on other search engines like DDG) to only return information from before a certain date. It&#x27;s not flawless, as some media companies use a more or less static URL that they update frequently, but SEO-optimizers like this are generally pretty easy to screen out.<p>That said it&#x27;s a problem, even if it&#x27;s just the latest iteration of an older problem like content farming, article spinners and so on. I&#x27;ve said for years that spam is the ultimate cancer and that the tech community&#x27;s general indifference to spam and scams will be its downfall.
aaronblohowiakabout 1 year ago
Internet archive?
评论 #40058472 未加载
neilkabout 1 year ago
Using &quot;before:2023&quot; in your Google query helps. For now.<p>A few months ago, Lispi314 made a very interesting suggestion: an index of the ad-free internet. If you can filter ads and affiliate links then spam is harder to monetize.<p><a href="https:&#x2F;&#x2F;udongein.xyz&#x2F;notice&#x2F;AcwmRcIzxOLmrSamum" rel="nofollow">https:&#x2F;&#x2F;udongein.xyz&#x2F;notice&#x2F;AcwmRcIzxOLmrSamum</a><p>There are some obvious problems with it, but I think I&#x27;d still like to see what that would look like.
评论 #40060404 未加载
giantg2about 1 year ago
Sure, we can take a snapshot of our bot filled web today before it goes true AI. Not sure what the real benefit would be.
dudusabout 1 year ago
I have a sliver of hope AI generated content will actually be good one day. Just like I believe automated cars will be better than humans. I have nothing against reading content that was written by AI, for some of my reading.
ccgregabout 1 year ago
I&#x27;ve been giving talks about Common Crawl for the last year with a slide about exactly this, using low background steel as an example.
greyzor7about 1 year ago
that&#x27;s what archive.org already does, but if you want to re-implement it, you would have to crawl all the web, eventually save thumbnails of pages with screenshotone (<a href="https:&#x2F;&#x2F;microlaunch.net&#x2F;p&#x2F;screenshotone" rel="nofollow">https:&#x2F;&#x2F;microlaunch.net&#x2F;p&#x2F;screenshotone</a>)
wseqyrkuabout 1 year ago
&gt; recently auto-generated junk<p>this would only apply for pre-agi era though
MattGaiserabout 1 year ago
Is this really all that different from the procedurally generated drivel or the offshore freelance copy&#x2F;paste generated drivel?<p>I find that I get a lot more AI content, but it mostly displaced the original freelancer&#x2F;procedurally generated spam.
metadatabout 1 year ago
Reality is a mess in a lot of ways. Unfortunately in this case, it&#x27;s a bit late.<p>Wouldn&#x27;t it be nice if Elgoog, OpenAI, or Character.ai published this dataset, considering they definitely have it, and also they caused this issue.<p>I&#x27;m not holding my breath.
jamesy0ungabout 1 year ago
Internet Archive exists for webpages
acheronabout 1 year ago
The web has been overrun by drivel for over two decades now.
评论 #40058691 未加载
mceoinabout 1 year ago
Isn’t this common crawl?
RecycledEleabout 1 year ago
It&#x27;s way too late.
LorenDBabout 1 year ago
r&#x2F;Datahoarder probably already has you covered.
fuzztesterabout 1 year ago
Same seems to have been happening on hn from the last several months.<p>had actually posted a question about this around that time, but the only reply i got was by a guy saying it is not likely, because the hn hive mind would drive down such posts.<p>not sure if he was right because I still see evidence of such stuff.
alpenbaziabout 1 year ago
yes
keepamovinabout 1 year ago
Embrace it. Stop living in the past, Gatsby. Just ask ChatGPT for the answers you seek. Hahaha! :)<p>What are you searching for anyway??