I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.<p>It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.
I created <a href="https://lowbackgroundsteel.ai/" rel="nofollow">https://lowbackgroundsteel.ai/</a> in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.
I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say "we're not all like that", even though it is a trend and we're close to peak hype (slightly past even?).<p>The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").<p>Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.<p>When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good <i>will</i> prevail in the end.
I'm going to call it: The Web is dead. Thanks to "AI" I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.<p>As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.<p>Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.
<i>"I don't think anyone has reliable information about post-2021 language usage by humans."</i><p>We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.<p>Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.<p>The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.
> Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing.<p>Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.
Somehow related, paper books from before 2020 could be a valuable commodity in a in a decade or two, when the Internet will be full of slop and even contemporary paper books will be treated with suspicion. And there will be human talking heads posing as the authors of books written by very smart AIs. God, why are we doing this????
I feel so conflicted about this.<p>On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).<p>But...<p>A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.<p>The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).<p>Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.<p>Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.<p>Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.
Did we (the humans) somehow managed to pollute the internet so much with AI that's it's now barely usable ?<p>In my opinion the internet can be considered as the equivalent of a natural environment like the earth. it's a space where people share, meet, talk, etc.<p>I find it astonishing that after polluting our natural environment we know polluted the internet.
All those writers who'll soon be out of job and/or already are and basically unhireable for their previous tasks should be paid for by the AI hyperscalers to write anything at all on one condition: not a single sentence in their works should be created with AI.<p>(I initially wanted to say 'paid for by the government' but that'd be socialising losses and we've had quite enough of that in the past.)
This is one of the vanguards warning of the changes coming in the post-AI world.<p>>> Generative AI has polluted the data<p>Just like low-background steel marks the break in history from before and after the nuclear age, these types of data mark the distinction from before and after AI.<p>Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn. Meanwhile certain low-frequency data will be selected by these models less and less and will become suppressed and possibly eliminated. We know from classic NLP techniques that low frequency words are often among the highest in information content and descriptive power.<p>Bitrot will continue to act as the agent of Entropy further reducing pre-AI datasets.<p>These feedback loops will persist, language will be ground down, neologisms will be prevented and...society, no longer with the mental tools to describe changing circumstances; new thoughts unable to be realized, will cease to advance and then regress.<p>Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.<p>The glimmer of hope is that humanity figured out how to rise from the dreamstate of the world of animals once. Future humans will be able to climb from the ashes again. There used to be a word, the name of a bird, that encoded this ability to die and return again, but that name is already lost to the machines that will take our tongues.
I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.
I think this person has too high a view of pre-2021, probably for ego reasons. In fact, their attitude seems very ego driven. AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021, it was much harder if not impossible to detect. If anything, it's probably easier now since people are all using the same ai that use words like delve so much much it becomes obvious.
I have been noticing this trend increasingly myself. It's getting more and more difficult to use tools like Google search to find relevant content.<p>Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.<p>On the positive side:<p>1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.<p>2. As a result of (1), I have started to use an RSS reader again. Who would have thought!<p>I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.
We will soon face another kind of bit-rot : where so much text is generated by LLMs that it pollutes the human natural language corpus available for training, on the web.<p>Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.<p>But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.
If it is (apparently) easy for humans to tell when content is AI-generated slop, then it should be possible to develop an AI to distinguish human-created content.<p>As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.
Not to be too dismissive, but is there a worthwhile direction of research to pursue that is not LLM’s in NLP?<p>If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?
It could be used to spot LLM generated text.<p>compare the frequency of words to those used in human natural writings and you spot the computer from the human.
I guess a manageable, still-useful alternative would be to curate a whitelist of sources that don't use AI, and without making that list public, derive the word frequencies from only those sources. How to compile that list is left as an exercise for the reader. The result would not be as accurate as a broad sample of the web, but in a world where it's impossible to trust a broad sample of the web, it the option you are left with. And I have no reason to doubt that it could be done at a useful scale.<p>I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.
Web before 2021 was still polluted by content farms. The articles were written by humans, but still, they were rubbish. Not compared to current rate of generation, but the web was already dominated by them.
It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.<p>Might even change the tool name.
The year 2021 is to wordfreq what 1945 was to carbon carbon-14 dating.<p>I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.<p>Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.
Reading through this entire thread, I suspect that somehow generative AI actually became a political issue. Polarized politics is like a vortex sucking all kinds of unrelated things in.<p>In case that doesn't get my comment completely buried, I will go ahead and say honestly that even though "AI slop" and paywalled content is a problem, I don't think that generative AI in itself is a negative at all. And I also think that part of this person's reaction is that LLMs have made previous NLP techniques, such a those based on simple usage counts etc., largely irrelevant.<p>What was/is wordfreq used for, and can those tasks not actually be done more effectively with a cutting edge language model of some sort these days? Maybe even a really small one for some things.
I hear this complaint often but in reality I have encountered fairly little content in my day to day that has felt fully AI generated? AI assisted sure, but is that a problem if a human is in the mix, curating?<p>I certainly have not encountered enough straight drivel where I would think it would have a significant effect on overall word statistics.<p>I suspect there may be some over-identification of AI content happening, a sort of Baader–Meinhof effect cognitive bias. People have their eye out for it and suddenly everything that reads a little weird logically "must be AI generated" and isn't just a bad human writer.<p>Maybe I am biased, about a decade ago I worked for an SEO company with a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read. It would rot your brain if you tried, and it was written by hand by a team of humans beings. This existed WELL before generative AI.
Science publications until 1955 may be last ones not contaminated by calculators.<p><a href="https://news.ycombinator.com/item?id=34966335">https://news.ycombinator.com/item?id=34966335</a><p>We will all get used to it.
One of the examples is the increased usage of "delve" which Google Trends confirms increased in usage since 2022 (initial ChatGPT release): <a href="https://trends.google.com/trends/explore?date=all&q=delve&hl=en" rel="nofollow">https://trends.google.com/trends/explore?date=all&q=delve&hl...</a><p>It seems however it started increasing most in usage just these last few months, maybe people are talking more about "delve" specifically because of the increase in usage? A usage recursion of some sorts.
Okay but how big of a sample size do we even actually need for word frequencies? Like what’s the goal here? It looks like the initial project isn’t even stratified per year/decade
It's just inevitable. Imagine a world where we get a cheap and accessable AGI. Most work in the world will be done by it. Certainly, it will organise the work the way it finds more preferable. Humans (and other AIs) will find it much harder to train from example as most of the work is performed in the same uniform way.
The AI revolution should start with the field closest to its roots.
<a href="https://trends.google.com/trends/explore?date=all&geo=US&q=delve&hl=en" rel="nofollow">https://trends.google.com/trends/explore?date=all&geo=US&q=d...</a><p>The funny fact: It doesn't result in the increase for search results for "delve".
We need a vintage data/handmade data service. A service that can provide text and images for training that are guaranteed to have either been produced by a human or produced before 2021.<p>Someone should start scanning all those microfiche archives in local libraries and sell the data.
Enshittification is accelerating. A good 70% of my Facebook feed is now obviously AI generated images with AI generated text blurbs that have nothing to do with the accompanying images likely posted by overseas bot farms. I'm also noticing more and more "books" on Amazon that are clearly AI generated and self published.
If generative AI has a significantly different word frequency from humans then it also shouldn't be hard to detect text written generative AI. However my last information is that tools to detect text written by generative AI are not that great.
Most of the "random" bot content pre-2021 was low-quality Markov-generated text. If anything, these genitive AI tools would improve the accuracy of scraping large corpora of text from the web.
I've wondered from time to time why I collect history books, keep my encyclopedias, when I could just google it. Now I know why. They predate AI and are unpolluted by generated bilge.
I agree with the general ethos of the piece (albeit a few of the details are puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel, nor does what Reddit is doing make much intellectual as opposed to economic [IPO-influenced] sense - but this line:<p>'OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.'<p>really does betray some real naivete. OpenAI and Google could literally burn $10million dollars per day (okay, maybe not OpenAI - but Google surely could) and reasonably fail to notice. Whatever costs those companies have to pay to collect training data will be well worth it to them. Any messes made in the course of obtaining that data will be dealt with by an army of employees either manually cleaning up the data, or by algorithms Google has its own LLM write for itself.<p>I do find the general sense of impending dystopian inhumanity arising out of the explosion of LLMs to be super fascinating (and completely understandable).
Called it (unfortunately): <a href="https://news.ycombinator.com/item?id=34301852">https://news.ycombinator.com/item?id=34301852</a>
Man the AI folks really wrecked everything. Reminds me of when those scooter companies started just dumping their scooters everywhere without asking anybody if they wanted this.
A few years ago I began an effort to write a new tech book. I planned orig to do as much of it as I could across a series of commits in a public GitHub repo of mine.<p>I then changed course. Why? I had read increasing reports of human e-book pirates (copying your book's content then repackaging it for sale under a diff title, byline, cover, and possibly at a much lower or even much higher price.)<p>And then the rise of LLMs and their ravenous training ingest bots -- plagiarism at scale and potentially even easier to disguise.<p>"Not gonna happen." - Bush Sr., via Dana Carvey<p>Now I keep the bulk of my book material non-public during dev. I'm sure I'll share a chapter candidate or so at some point before final release, for feedback and publicity. But the bulk will debut all together at once, and only once polished and behind a paywall
Sad to see wordfreq halted, it was a real party for linguistics enthusiasts. For those seeking new tools, keep expanding your knowledge with socialsignalai.
"Multi-script languages<p>Two of the languages we support, Serbian and Chinese, are written in multiple scripts. To avoid spurious differences in word frequencies, we automatically transliterate the characters in these languages when looking up their words.<p>Serbian text written in Cyrillic letters is automatically converted to Latin letters, using standard Serbian transliteration, when the requested language is sr or sh."<p>I'd support keeping both scripts (српска ћирилица and latin script) , similarly to hiragana (ひらがな) and katakana (カタカナ) in Japanese.
I think the main reason for sunsetting the project is hinted at near the bottom:<p>> <i>The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money.</i><p>Traditional NLP has been surpassed by transformers, making this project obsolete. The rest of the post reads like rationalization and sour grapes.
Has anyone taken a look at a random sample of web data? It's mostly crap. I was thinking of making my own search engine, knowledge database etc based on a random sample of web pages, but I found that almost all of them were drivel. Flame wars, asinine blog posts, and most of all, advertising. Forget spam, most of the legit pages are trying to sell something too!<p>The conclusion I arrived at was that making my own crawler actually is feasible (and given my goals, necessary!) because I'm only interested in a very, very small fraction of what's out there.
Wow there is so much vitriol both in this post and in the comments here. I understand that there are many ethical and practical problems with generative AI, but when did we stop being hopeful and start seeing the darkest side of everything? Is it just that the average HN reader is now past the age where a new technological development is an exciting opportunity and on to the age where it is a threat? Remember, the Luddites were not opposed to looms, they just wanted to own them.
> the Web at large is full of slop generated by large language models, written by no one to communicate nothing<p>That’s neither fair nor accurate. That slop is ultimately generated by the humans who run those models; they are attempting (perhaps poorly) to communicate <i>something</i>.<p>> two companies that I already despise<p>Life’s too short to go through it hating others.<p>> it's very likely because they are creating a plagiarism machine that will claim your words as its own<p>That begs the question. Plagiarism has a particular definition. It is not at all clear that a machine learning from text should be treated any differently from a human being learning from text: i.e., duplicating exact phrases or failing to credit ideas may in some circumstances be plagiarism, but no-one is required to append a statement crediting every text he has ever read to every document he ever writes.<p>Credits: every document I have ever read <i>grin</i>
Ok so post author is AI skeptic and this is his retaliation, likely because his work is affected. I believe governments should address the problem with welfare but being against technical advances is always being in the wrong side of history.
> It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google, two companies that I already despise.<p>The dependency on closed data combined with the cost of compute to do anything interesting with LLMs has made individual contributions to NLP research extremely difficult if one is not associated with a very large tech company. It's super unfortunate, makes the subject area much less approachable, and makes the people doing research in the subject area much more homogeneous.
I really like the fact that the content of the conventional user content internet is becoming willfully polluted and ever more useless by the incessant influx of "ai"-garbage. At some point all of this will become so awful that nerds will create new and quiet corners of real people and real information while the idiot rabble has to use new and expensive tools peddled by scammy tech bros to handle the stench of automated manure that flows out of stagnant llms digesting themselves.
>"Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X. Even if X made its raw data feed available (which it doesn't), there would be no valuable information to be found there.<p>>Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.<p>>And given what's happening to the field, I don't blame them."<p>What beautiful doublethink.
> Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X<p>God I hate this dystopic timeline we live in.
This has to be the most annoying hacker news comment section I've ever seen. It's just the same ~4 viewpoints rehashed again, and again, and again. Why don't folks just upvote other comments that say the same thing instead of repeating the same things?<p>And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring.<p>Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( <a href="https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-shrinking.html?m=1" rel="nofollow">https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s...</a> ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean.<p>Here's to a comment that will never be read because of all the noise in this thread :/
I understand the frustration shared in this post but I wholeheartedly disagree with the overall sentiment that comes with it.<p>The web isn't dead, (Gen)AI, SEO, spam and pollution didn't kill anything.<p>The world is chaotic and net entropy (degree of disorder) of any isolated or closed system will always increase. Same goes for the web. We just have to embrace it and overcome the challenges that come with it.