Why wordfreq will not be updated

1707 pointsby tomthe8 months ago

66 comments

voytec8 months ago

I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

评论 #41580210 未加载

评论 #41579452 未加载

评论 #41579430 未加载

评论 #41579078 未加载

评论 #41584830 未加载

评论 #41588217 未加载

评论 #41583584 未加载

评论 #41585738 未加载

评论 #41579081 未加载

评论 #41579398 未加载

jgrahamc8 months ago

I created <a href="https://lowbackgroundsteel.ai/" rel="nofollow">https://lowbackgroundsteel.ai/</a> in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.

评论 #41583491 未加载

评论 #41579084 未加载

评论 #41581473 未加载

评论 #41579925 未加载

评论 #41588790 未加载

评论 #41580991 未加载

评论 #41586238 未加载

jll298 months ago

I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say "we're not all like that", even though it is a trend and we're close to peak hype (slightly past even?).The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.

评论 #41581670 未加载

评论 #41581803 未加载

评论 #41582764 未加载

评论 #41582757 未加载

评论 #41586168 未加载

0xbadcafebee8 months ago

I'm going to call it: The Web is dead. Thanks to "AI" I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.

评论 #41586350 未加载

评论 #41583668 未加载

评论 #41584272 未加载

评论 #41586942 未加载

评论 #41588805 未加载

评论 #41583434 未加载

评论 #41588664 未加载

评论 #41584243 未加载

评论 #41587006 未加载

评论 #41584468 未加载

评论 #41587286 未加载

weinzierl8 months ago

"I don't think anyone has reliable information about post-2021 language usage by humans."We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.

评论 #41580435 未加载

评论 #41580674 未加载

评论 #41581583 未加载

评论 #41579691 未加载

dweinus8 months ago

> Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing.Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.

评论 #41583510 未加载

评论 #41583227 未加载

评论 #41580397 未加载

dsign8 months ago

Somehow related, paper books from before 2020 could be a valuable commodity in a in a decade or two, when the Internet will be full of slop and even contemporary paper books will be treated with suspicion. And there will be human talking heads posing as the authors of books written by very smart AIs. God, why are we doing this????

评论 #41579694 未加载

评论 #41580356 未加载

评论 #41583998 未加载

评论 #41580287 未加载

aryonoco8 months ago

I feel so conflicted about this.On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).But...A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.

评论 #41583054 未加载

aucisson_masque8 months ago

Did we (the humans) somehow managed to pollute the internet so much with AI that's it's now barely usable ?In my opinion the internet can be considered as the equivalent of a natural environment like the earth. it's a space where people share, meet, talk, etc.I find it astonishing that after polluting our natural environment we know polluted the internet.

评论 #41579043 未加载

评论 #41580337 未加载

评论 #41580586 未加载

评论 #41579370 未加载

评论 #41579050 未加载

评论 #41587253 未加载

评论 #41582793 未加载

评论 #41579066 未加载

评论 #41581696 未加载

baq8 months ago

All those writers who'll soon be out of job and/or already are and basically unhireable for their previous tasks should be paid for by the AI hyperscalers to write anything at all on one condition: not a single sentence in their works should be created with AI.(I initially wanted to say 'paid for by the government' but that'd be socialising losses and we've had quite enough of that in the past.)

评论 #41578983 未加载

评论 #41579173 未加载

评论 #41578957 未加载

评论 #41579069 未加载

评论 #41579022 未加载

bane8 months ago

This is one of the vanguards warning of the changes coming in the post-AI world.>> Generative AI has polluted the dataJust like low-background steel marks the break in history from before and after the nuclear age, these types of data mark the distinction from before and after AI.Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn. Meanwhile certain low-frequency data will be selected by these models less and less and will become suppressed and possibly eliminated. We know from classic NLP techniques that low frequency words are often among the highest in information content and descriptive power.Bitrot will continue to act as the agent of Entropy further reducing pre-AI datasets.These feedback loops will persist, language will be ground down, neologisms will be prevented and...society, no longer with the mental tools to describe changing circumstances; new thoughts unable to be realized, will cease to advance and then regress.Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.The glimmer of hope is that humanity figured out how to rise from the dreamstate of the world of animals once. Future humans will be able to climb from the ashes again. There used to be a word, the name of a bird, that encoded this ability to die and return again, but that name is already lost to the machines that will take our tongues.

评论 #41583271 未加载

评论 #41583359 未加载

评论 #41583282 未加载

评论 #41583262 未加载

评论 #41586166 未加载

oneeyedpigeon8 months ago

I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.

评论 #41579304 未加载

greentxt8 months ago

I think this person has too high a view of pre-2021, probably for ego reasons. In fact, their attitude seems very ego driven. AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021, it was much harder if not impossible to detect. If anything, it's probably easier now since people are all using the same ai that use words like delve so much much it becomes obvious.

评论 #41581469 未加载

miguno8 months ago

I have been noticing this trend increasingly myself. It's getting more and more difficult to use tools like Google search to find relevant content.Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.On the positive side:1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.2. As a result of (1), I have started to use an RSS reader again. Who would have thought!I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.

评论 #41587069 未加载

jgord8 months ago

We will soon face another kind of bit-rot : where so much text is generated by LLMs that it pollutes the human natural language corpus available for training, on the web.Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.

评论 #41587374 未加载

jchook8 months ago

If it is (apparently) easy for humans to tell when content is AI-generated slop, then it should be possible to develop an AI to distinguish human-created content.As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.

评论 #41585664 未加载

sashank_15098 months ago

Not to be too dismissive, but is there a worthwhile direction of research to pursue that is not LLM’s in NLP?If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?

评论 #41583268 未加载

aucisson_masque8 months ago

It could be used to spot LLM generated text.compare the frequency of words to those used in human natural writings and you spot the computer from the human.

评论 #41579129 未加载

评论 #41579316 未加载

评论 #41580730 未加载

评论 #41581900 未加载

karaterobot8 months ago

I guess a manageable, still-useful alternative would be to curate a whitelist of sources that don't use AI, and without making that list public, derive the word frequencies from only those sources. How to compile that list is left as an exercise for the reader. The result would not be as accurate as a broad sample of the web, but in a world where it's impossible to trust a broad sample of the web, it the option you are left with. And I have no reason to doubt that it could be done at a useful scale.I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.

评论 #41580840 未加载

PeterStuer8 months ago

Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?

评论 #41580039 未加载

评论 #41579991 未加载

评论 #41580360 未加载

评论 #41589675 未加载

charlieyu18 months ago

Web before 2021 was still polluted by content farms. The articles were written by humans, but still, they were rubbish. Not compared to current rate of generation, but the web was already dominated by them.

评论 #41589819 未加载

altcognito8 months ago

It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.Might even change the tool name.

评论 #41578841 未加载

jadayesnaamsi8 months ago

The year 2021 is to wordfreq what 1945 was to carbon carbon-14 dating.I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.

ilaksh8 months ago

Reading through this entire thread, I suspect that somehow generative AI actually became a political issue. Polarized politics is like a vortex sucking all kinds of unrelated things in.In case that doesn't get my comment completely buried, I will go ahead and say honestly that even though "AI slop" and paywalled content is a problem, I don't think that generative AI in itself is a negative at all. And I also think that part of this person's reaction is that LLMs have made previous NLP techniques, such a those based on simple usage counts etc., largely irrelevant.What was/is wordfreq used for, and can those tasks not actually be done more effectively with a cutting edge language model of some sort these days? Maybe even a really small one for some things.

评论 #41580232 未加载

评论 #41581883 未加载

评论 #41585506 未加载

donatj8 months ago

I hear this complaint often but in reality I have encountered fairly little content in my day to day that has felt fully AI generated? AI assisted sure, but is that a problem if a human is in the mix, curating?I certainly have not encountered enough straight drivel where I would think it would have a significant effect on overall word statistics.I suspect there may be some over-identification of AI content happening, a sort of Baader–Meinhof effect cognitive bias. People have their eye out for it and suddenly everything that reads a little weird logically "must be AI generated" and isn't just a bad human writer.Maybe I am biased, about a decade ago I worked for an SEO company with a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read. It would rot your brain if you tried, and it was written by hand by a team of humans beings. This existed WELL before generative AI.

评论 #41579938 未加载

anovikov8 months ago

Sad. I'd love to see by how much the use of world "delve" has increased since 2021...

评论 #41579051 未加载

评论 #41578916 未加载

评论 #41582076 未加载

评论 #41580173 未加载

评论 #41584435 未加载

评论 #41579002 未加载

amai8 months ago

Science publications until 1955 may be last ones not contaminated by calculators.<a href="https://news.ycombinator.com/item?id=34966335">https://news.ycombinator.com/item?id=34966335</a>We will all get used to it.

diggan8 months ago

One of the examples is the increased usage of "delve" which Google Trends confirms increased in usage since 2022 (initial ChatGPT release): <a href="https://trends.google.com/trends/explore?date=all&q=delve&hl=en" rel="nofollow">https://trends.google.com/trends/explore?date=all&q=delve&hl...</a>It seems however it started increasing most in usage just these last few months, maybe people are talking more about "delve" specifically because of the increase in usage? A usage recursion of some sorts.

评论 #41582140 未加载

评论 #41582159 未加载

评论 #41587133 未加载

hcks8 months ago

Okay but how big of a sample size do we even actually need for word frequencies? Like what’s the goal here? It looks like the initial project isn’t even stratified per year/decade

nlpparty8 months ago

It's just inevitable. Imagine a world where we get a cheap and accessable AGI. Most work in the world will be done by it. Certainly, it will organise the work the way it finds more preferable. Humans (and other AIs) will find it much harder to train from example as most of the work is performed in the same uniform way. The AI revolution should start with the field closest to its roots.

grogenaut8 months ago

Is 2023 going to be for data what the trinity test was for iron? Eg post 2023 all data now contains trace amounts of ai?

评论 #41582700 未加载

nlpparty8 months ago

<a href="https://trends.google.com/trends/explore?date=all&geo=US&q=delve&hl=en" rel="nofollow">https://trends.google.com/trends/explore?date=all&geo=US&q=d...</a>The funny fact: It doesn't result in the increase for search results for "delve".

评论 #41587135 未加载

jedberg8 months ago

We need a vintage data/handmade data service. A service that can provide text and images for training that are guaranteed to have either been produced by a human or produced before 2021.Someone should start scanning all those microfiche archives in local libraries and sell the data.

DebtDeflation8 months ago

Enshittification is accelerating. A good 70% of my Facebook feed is now obviously AI generated images with AI generated text blurbs that have nothing to do with the accompanying images likely posted by overseas bot farms. I'm also noticing more and more "books" on Amazon that are clearly AI generated and self published.

评论 #41579142 未加载

评论 #41582991 未加载

jhack8 months ago

Kind of weird to believe “slop” didn’t exist on the internet in mass quantities before AI.

zaik8 months ago

If generative AI has a significantly different word frequency from humans then it also shouldn't be hard to detect text written generative AI. However my last information is that tools to detect text written by generative AI are not that great.

joshdavham8 months ago

If the language you’re processing was generated by AI, it’s no longer NLP, it’s ALP.

ok1234568 months ago

Most of the "random" bot content pre-2021 was low-quality Markov-generated text. If anything, these genitive AI tools would improve the accuracy of scraping large corpora of text from the web.

WalterBright8 months ago

I've wondered from time to time why I collect history books, keep my encyclopedias, when I could just google it. Now I know why. They predate AI and are unpolluted by generated bilge.

avazhi8 months ago

I agree with the general ethos of the piece (albeit a few of the details are puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel, nor does what Reddit is doing make much intellectual as opposed to economic [IPO-influenced] sense - but this line:'OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.'really does betray some real naivete. OpenAI and Google could literally burn $10million dollars per day (okay, maybe not OpenAI - but Google surely could) and reasonably fail to notice. Whatever costs those companies have to pay to collect training data will be well worth it to them. Any messes made in the course of obtaining that data will be dealt with by an army of employees either manually cleaning up the data, or by algorithms Google has its own LLM write for itself.I do find the general sense of impending dystopian inhumanity arising out of the explosion of LLMs to be super fascinating (and completely understandable).

评论 #41589803 未加载

amai8 months ago

Called it (unfortunately): <a href="https://news.ycombinator.com/item?id=34301852">https://news.ycombinator.com/item?id=34301852</a>

shortrounddev28 months ago

Man the AI folks really wrecked everything. Reminds me of when those scooter companies started just dumping their scooters everywhere without asking anybody if they wanted this.

评论 #41579282 未加载

评论 #41579418 未加载

syngrog668 months ago

A few years ago I began an effort to write a new tech book. I planned orig to do as much of it as I could across a series of commits in a public GitHub repo of mine.I then changed course. Why? I had read increasing reports of human e-book pirates (copying your book's content then repackaging it for sale under a diff title, byline, cover, and possibly at a much lower or even much higher price.)And then the rise of LLMs and their ravenous training ingest bots -- plagiarism at scale and potentially even easier to disguise."Not gonna happen." - Bush Sr., via Dana CarveyNow I keep the bulk of my book material non-public during dev. I'm sure I'll share a chapter candidate or so at some point before final release, for feedback and publicity. But the bulk will debut all together at once, and only once polished and behind a paywall

jijojohnxx8 months ago

Sad to see wordfreq halted, it was a real party for linguistics enthusiasts. For those seeking new tools, keep expanding your knowledge with socialsignalai.

iamnotsure8 months ago

"Multi-script languagesTwo of the languages we support, Serbian and Chinese, are written in multiple scripts. To avoid spurious differences in word frequencies, we automatically transliterate the characters in these languages when looking up their words.Serbian text written in Cyrillic letters is automatically converted to Latin letters, using standard Serbian transliteration, when the requested language is sr or sh."I'd support keeping both scripts (српска ћирилица and latin script) , similarly to hiragana (ひらがな) and katakana (カタカナ) in Japanese.

评论 #41579427 未加载

jonas218 months ago

I think the main reason for sunsetting the project is hinted at near the bottom:> The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money.Traditional NLP has been surpassed by transformers, making this project obsolete. The rest of the post reads like rationalization and sour grapes.

评论 #41587323 未加载

tqi8 months ago

"Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable."How sure can we be about that?

thesnide8 months ago

I think that text on the internet will tainted by AI the same way that steel has being tainted by nuclear devices.

yarg8 months ago

Generative AI has done to human speech analysis what atmospheric testing did to carbon dating.

andai8 months ago

Has anyone taken a look at a random sample of web data? It's mostly crap. I was thinking of making my own search engine, knowledge database etc based on a random sample of web pages, but I found that almost all of them were drivel. Flame wars, asinine blog posts, and most of all, advertising. Forget spam, most of the legit pages are trying to sell something too!The conclusion I arrived at was that making my own crawler actually is feasible (and given my goals, necessary!) because I'm only interested in a very, very small fraction of what's out there.

评论 #41590733 未加载

aftbit8 months ago

Wow there is so much vitriol both in this post and in the comments here. I understand that there are many ethical and practical problems with generative AI, but when did we stop being hopeful and start seeing the darkest side of everything? Is it just that the average HN reader is now past the age where a new technological development is an exciting opportunity and on to the age where it is a threat? Remember, the Luddites were not opposed to looms, they just wanted to own them.

评论 #41583135 未加载

评论 #41581418 未加载

评论 #41581992 未加载

honksillet8 months ago

Twitter was a botnet long before LLMs and Musk got involved.

eadmund8 months ago

> the Web at large is full of slop generated by large language models, written by no one to communicate nothingThat’s neither fair nor accurate. That slop is ultimately generated by the humans who run those models; they are attempting (perhaps poorly) to communicate something.> two companies that I already despiseLife’s too short to go through it hating others.> it's very likely because they are creating a plagiarism machine that will claim your words as its ownThat begs the question. Plagiarism has a particular definition. It is not at all clear that a machine learning from text should be treated any differently from a human being learning from text: i.e., duplicating exact phrases or failing to credit ideas may in some circumstances be plagiarism, but no-one is required to append a statement crediting every text he has ever read to every document he ever writes.Credits: every document I have ever read grin

评论 #41580588 未加载

评论 #41582033 未加载

评论 #41580440 未加载

whimsicalism8 months ago

NLP and especially 'computational linguistics' in academia has been captured by certain political interests, this is reflective of that.

antirez8 months ago

Ok so post author is AI skeptic and this is his retaliation, likely because his work is affected. I believe governments should address the problem with welfare but being against technical advances is always being in the wrong side of history.

评论 #41581692 未加载

jijojohnxx8 months ago

Looks like the wordfreq party is over. Time for the next wave of knowledge tools, wonder what socialsignalai could bring to the table.

will-burner8 months ago

> It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google, two companies that I already despise.The dependency on closed data combined with the cost of compute to do anything interesting with LLMs has made individual contributions to NLP research extremely difficult if one is not associated with a very large tech company. It's super unfortunate, makes the subject area much less approachable, and makes the people doing research in the subject area much more homogeneous.

floppiplopp8 months ago

I really like the fact that the content of the conventional user content internet is becoming willfully polluted and ever more useless by the incessant influx of "ai"-garbage. At some point all of this will become so awful that nerds will create new and quiet corners of real people and real information while the idiot rabble has to use new and expensive tools peddled by scammy tech bros to handle the stench of automated manure that flows out of stagnant llms digesting themselves.

评论 #41581528 未加载

评论 #41579902 未加载

assanineass8 months ago

Well said

keeptakingshots8 months ago

thank you for sharing this.

hoseja8 months ago

>"Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X. Even if X made its raw data feed available (which it doesn't), there would be no valuable information to be found there.>Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.>And given what's happening to the field, I don't blame them."What beautiful doublethink.

评论 #41579057 未加载

yard20108 months ago

> Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called XGod I hate this dystopic timeline we live in.

cdrini8 months ago

This has to be the most annoying hacker news comment section I've ever seen. It's just the same ~4 viewpoints rehashed again, and again, and again. Why don't folks just upvote other comments that say the same thing instead of repeating the same things?And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring.Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( <a href="https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-shrinking.html?m=1" rel="nofollow">https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s...</a> ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean.Here's to a comment that will never be read because of all the noise in this thread :/

评论 #41590151 未加载

评论 #41590040 未加载

评论 #41590109 未加载

评论 #41590304 未加载

评论 #41590508 未加载

评论 #41590987 未加载

adr1an8 months ago

I guess curating unpolluted text is one of the new jobs GenAI created? /s

primer428 months ago

Hear, hear!

QRe8 months ago

I understand the frustration shared in this post but I wholeheartedly disagree with the overall sentiment that comes with it.The web isn't dead, (Gen)AI, SEO, spam and pollution didn't kill anything.The world is chaotic and net entropy (degree of disorder) of any isolated or closed system will always increase. Same goes for the web. We just have to embrace it and overcome the challenges that come with it.

评论 #41585122 未加载

评论 #41584768 未加载

评论 #41588626 未加载

66 comments

voytec8 months ago

评论 #41580210 未加载

评论 #41579452 未加载

评论 #41579430 未加载

评论 #41579078 未加载

评论 #41584830 未加载

评论 #41588217 未加载

评论 #41583584 未加载

评论 #41585738 未加载

评论 #41579081 未加载

评论 #41579398 未加载

jgrahamc8 months ago

评论 #41583491 未加载

评论 #41579084 未加载

评论 #41581473 未加载

评论 #41579925 未加载

评论 #41588790 未加载

评论 #41580991 未加载

评论 #41586238 未加载

jll298 months ago

评论 #41581670 未加载

评论 #41581803 未加载

评论 #41582764 未加载

评论 #41582757 未加载

评论 #41586168 未加载

0xbadcafebee8 months ago

评论 #41586350 未加载

评论 #41583668 未加载

评论 #41584272 未加载

评论 #41586942 未加载

评论 #41588805 未加载

评论 #41583434 未加载

评论 #41588664 未加载

评论 #41584243 未加载

评论 #41587006 未加载

评论 #41584468 未加载

评论 #41587286 未加载

weinzierl8 months ago

评论 #41580435 未加载

评论 #41580674 未加载

评论 #41581583 未加载

评论 #41579691 未加载

dweinus8 months ago

评论 #41583510 未加载

评论 #41583227 未加载

评论 #41580397 未加载

dsign8 months ago

评论 #41579694 未加载

评论 #41580356 未加载

评论 #41583998 未加载

评论 #41580287 未加载

aryonoco8 months ago

评论 #41583054 未加载

aucisson_masque8 months ago

评论 #41579043 未加载

评论 #41580337 未加载

评论 #41580586 未加载

评论 #41579370 未加载

评论 #41579050 未加载

评论 #41587253 未加载

评论 #41582793 未加载

评论 #41579066 未加载

评论 #41581696 未加载

baq8 months ago

评论 #41578983 未加载

评论 #41579173 未加载

评论 #41578957 未加载

评论 #41579069 未加载

评论 #41579022 未加载

bane8 months ago

评论 #41583271 未加载

评论 #41583359 未加载

评论 #41583282 未加载

评论 #41583262 未加载

评论 #41586166 未加载

oneeyedpigeon8 months ago

评论 #41579304 未加载

greentxt8 months ago

评论 #41581469 未加载

miguno8 months ago

评论 #41587069 未加载

jgord8 months ago

评论 #41587374 未加载

jchook8 months ago

评论 #41585664 未加载

sashank_15098 months ago

评论 #41583268 未加载

aucisson_masque8 months ago

It could be used to spot LLM generated text.compare the frequency of words to those used in human natural writings and you spot the computer from the human.

评论 #41579129 未加载

评论 #41579316 未加载

评论 #41580730 未加载

评论 #41581900 未加载

karaterobot8 months ago

评论 #41580840 未加载

PeterStuer8 months ago

Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?

评论 #41580039 未加载

评论 #41579991 未加载

评论 #41580360 未加载

评论 #41589675 未加载

charlieyu18 months ago

评论 #41589819 未加载

altcognito8 months ago

It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.Might even change the tool name.

评论 #41578841 未加载

jadayesnaamsi8 months ago

ilaksh8 months ago

评论 #41580232 未加载

评论 #41581883 未加载

评论 #41585506 未加载

donatj8 months ago

评论 #41579938 未加载

anovikov8 months ago

Sad. I'd love to see by how much the use of world "delve" has increased since 2021...

评论 #41579051 未加载

评论 #41578916 未加载

评论 #41582076 未加载

评论 #41580173 未加载

评论 #41584435 未加载

评论 #41579002 未加载

amai8 months ago

diggan8 months ago

评论 #41582140 未加载

评论 #41582159 未加载

评论 #41587133 未加载

hcks8 months ago

Okay but how big of a sample size do we even actually need for word frequencies? Like what’s the goal here? It looks like the initial project isn’t even stratified per year/decade

nlpparty8 months ago

grogenaut8 months ago

Is 2023 going to be for data what the trinity test was for iron? Eg post 2023 all data now contains trace amounts of ai?

评论 #41582700 未加载

nlpparty8 months ago

评论 #41587135 未加载

jedberg8 months ago

DebtDeflation8 months ago

评论 #41579142 未加载

评论 #41582991 未加载

jhack8 months ago

Kind of weird to believe “slop” didn’t exist on the internet in mass quantities before AI.

zaik8 months ago

joshdavham8 months ago

If the language you’re processing was generated by AI, it’s no longer NLP, it’s ALP.

ok1234568 months ago

Most of the "random" bot content pre-2021 was low-quality Markov-generated text. If anything, these genitive AI tools would improve the accuracy of scraping large corpora of text from the web.

WalterBright8 months ago

I've wondered from time to time why I collect history books, keep my encyclopedias, when I could just google it. Now I know why. They predate AI and are unpolluted by generated bilge.

avazhi8 months ago

评论 #41589803 未加载

amai8 months ago

Called it (unfortunately): <a href="https://news.ycombinator.com/item?id=34301852">https://news.ycombinator.com/item?id=34301852</a>

shortrounddev28 months ago

Man the AI folks really wrecked everything. Reminds me of when those scooter companies started just dumping their scooters everywhere without asking anybody if they wanted this.

评论 #41579282 未加载

评论 #41579418 未加载

syngrog668 months ago

jijojohnxx8 months ago

Sad to see wordfreq halted, it was a real party for linguistics enthusiasts. For those seeking new tools, keep expanding your knowledge with socialsignalai.

iamnotsure8 months ago

评论 #41579427 未加载

jonas218 months ago

评论 #41587323 未加载

tqi8 months ago

"Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable."How sure can we be about that?

thesnide8 months ago

I think that text on the internet will tainted by AI the same way that steel has being tainted by nuclear devices.

yarg8 months ago

Generative AI has done to human speech analysis what atmospheric testing did to carbon dating.

andai8 months ago

评论 #41590733 未加载

aftbit8 months ago

评论 #41583135 未加载

评论 #41581418 未加载

评论 #41581992 未加载

honksillet8 months ago

Twitter was a botnet long before LLMs and Musk got involved.

eadmund8 months ago

评论 #41580588 未加载

评论 #41582033 未加载

评论 #41580440 未加载

whimsicalism8 months ago

NLP and especially 'computational linguistics' in academia has been captured by certain political interests, this is reflective of that.

antirez8 months ago

评论 #41581692 未加载

jijojohnxx8 months ago

Looks like the wordfreq party is over. Time for the next wave of knowledge tools, wonder what socialsignalai could bring to the table.