Are you better than a language model at predicting the next word?

233 pointsby JoelEinbinder9 months ago

47 comments

jsnell9 months ago

It's a neat idea, though not what I expected from the title talking about "smart" :)You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.First, it'd be more engaging. Even the small version of the quiz is a bit long for something where you don't know what the payoff will be. Second, you'd get to see the correct answer while still having the context on why you replied the way you did.

评论 #41278496 未加载

评论 #41277466 未加载

评论 #41277499 未加载

评论 #41279521 未加载

JoelEinbinder9 months ago

I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments and compete against various language models. I used llama2 to generate three alternative completions for each comment creating a multiple choice question. For the local language models that you are competing against, I consider them having picked the answer with the lowest total perplexity of prompt + answer. I am able to replicate this behavior with the OpenAI models by setting a logit_bias that limits the llm to pick only one of the allowed answers. I tried just giving the full multiple choice question as a prompt and having it pick an answer, but that led to really poor results. So I'm not able to compare with Claude or any online LLMs that don't have logit_bias.I wouldn't call the quiz fun exactly. After playing with it a lot I think I've been able to consistently get above 50% of questions right. I have slowed down a lot answering each question, which I think LLMs have trouble doing.

评论 #41278238 未加载

评论 #41278603 未加载

chmod7759 months ago

<pre><code> you: 4/15 gpt-4o: 0/15 gpt-4: 1/15 gpt-4o-mini: 2/15 llama-2-7b: 2/15 llama-3-8b: 3/15 mistral-7b: 4/15 unigram: 1/15 </code></pre> Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.

评论 #41280314 未加载

layer89 months ago

This is also a good test for noticing that you spend too much time reading HN comments.

nojs9 months ago

Nice. I found you can beat this by picking the word least likely to be selected by a language model, because it seems like the alternative choices are generated by an LLM. “Pick the outlier” is the best strategy.This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.

评论 #41282182 未加载

评论 #41280041 未加载

modeless9 months ago

> You scored 11/15. The best language model, llama-2-7b, scored 10/15.I see that you get a random quiz every time, so results aren't comparable between people. I think I got an easy one. Neat game! If you could find a corpus that makes it easy for average humans to beat the LLMs, and add some nice design, maybe Wordle-style daily challenge plus social sharing etc, I could see it going viral just as a way for people to "prove" that they are "smarter" than AI.

评论 #41281637 未加载

anikan_vader9 months ago

Got 8/15, best AI model got 7/15, and unigram got 1/15.Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.

评论 #41278611 未加载

moritzwarhier9 months ago

This is the best interactive website about LLMs at a meta level (so excluding prompt interfaces for actual AIs) that I've seen so far.Quizzes can be magical.Haven't seen any cooler new language-related interactive fun-project on the web since:<a href="https://wikispeedruns.com/" rel="nofollow">https://wikispeedruns.com/</a>It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.Sharing this with a general audience could spark funny discussions about bubbles and biases :)

RheingoldRiver9 months ago

I don't quite understand, what makes "Okay I've" more correct than "Okay so"? No meaningful context was provided here, how do we know "Okay I've" was at all meaningfully correct?For the longer comments I understand, but for the ones where it's 1 or 2 words and many of the options are correct English phrases, I don't understand why there's bias towards one? Wouldn't we need a prompt here?Also, I got bored halfway through and selected "D" for all of them

pizza9 months ago

If the samples came from HN, I wonder how likely it is that the text is already a part of a dataset (ie common crawl snapshot) so that the LLMs have already seen them?edit: judging from the comments I saw, they were all quite recent, so I guess this isn't happening. Though I do know that ChatGPT can sometimes use a Bing search tool during chats, which can actually link to recently indexed text, but I highly doubt that the gpt4o-mini API model is doing that.

jdthedisciple9 months ago

Some of them are excerpts from a much larger context, which the LLM would be using for prediction, obviously giving them a gigantic edge.

Garlef9 months ago

I like it. It's a humorous reversal of the usual articles that boil down to "Look! I made the AI fail at something!"

TacticalCoder9 months ago

My computer can compute 573034897183834790x3019487439184798 in less than a millisecond. Doesn't make it smarter than me.

ChrisArchitect9 months ago

Related:Who's Smarter: AI or a 5-Year-Old?<a href="https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/" rel="nofollow">https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/</a>(<a href="https://news.ycombinator.com/item?id=41263363">https://news.ycombinator.com/item?id=41263363</a>)

stackghost9 months ago

This is just a test of how likely you are to generate the same word as the LLM. The LLM does not produce the "correct" next word as there are multiple correct words that fit grammatically and can be used to continue the sentence while maintaining context.I don't see what this has to do with being "smarter" than anything. Example:1. I see a business decision here. Arm cores have licensing fees attached to them. Arm is becoming ____a) etherb) ac) thed) moreBut who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?

评论 #41277668 未加载

Kiro9 months ago

Where do the incorrect options come from?

评论 #41279951 未加载

评论 #41278879 未加载

kqr9 months ago

For anyone else daring the full 100 question quiz: you need to get at least a third right to be considered better than guessing by traditional statistical standards. (You'd need more than half to be better than LLMs.)

dataflow9 months ago

I got 9/15, vs. 4/15 for an LLM. I assume these are lifted from HN? Seems like an indication I should spend less time here...

zoklet-enjoyer9 months ago

You scored 6/15. The best language model, gpt-4o, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 2/15.Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!

评论 #41278272 未加载

blitzar9 months ago

I took some mushrooms and hallucinated the answers.

silisili9 months ago

Was mine broken? One of my prompts was just '>'. So of course I guessed a random word. The answer key showed I got it wrong, but showed the right answer inserted into a longer prompt. Or is that how it's supposed to work?

评论 #41277369 未加载

akira25019 months ago

Yes. I can tell you about things that happened this morning. Your language model cannot.

评论 #41279969 未加载

nick34439 months ago

This isn't really the challenge (loss function) that language models are trained on. It's not a simple next-word challenge, they get more context, see how BERT was trained as a reference.

greesil9 months ago

Like a ML model I would prefer being scored with cross entropy and not right/wrong. Like, I might guess wrong but it might not be that far off in likelihood.

评论 #41283145 未加载

shakna9 months ago

So... If I picked the same results, in the same timeframe... And I don't think glue should go on pizza... Does that mean LLMs are completely useless to me?

lupire9 months ago

I got one of my own comments on the 15 question quiz!

wesselbindt9 months ago

I like the website, but it could be a bit more explicit about the point it's trying to make. Given that a lot of people tend to think of LLM as somehow a thinking entity rather than a statistical model for guessing the most likely next word, most will probably look at these questions and think the website is broken.

playingalong9 months ago

I've got 2/15, so worse then random choice... I guess partly because English is not my mother tongue.

fsndz9 months ago

Of course not, but that does not mean LLMs will lead to AGI. We might never build AGI in fact: <a href="https://www.lycee.ai/blog/why-no-agi-openai" rel="nofollow">https://www.lycee.ai/blog/why-no-agi-openai</a>

评论 #41282856 未加载

moralestapia9 months ago

>the quintessential language model task of predicting the next word?Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.

评论 #41278247 未加载

ZoomerCretin9 months ago

> 8. All of local politics in the muni I live in takes place in a forum like this, on Facebook[.] The electeds in our muni post on it; I've gotten two different local laws done by posting there (and I'm working on a bigger third); I met someone whose campaign I funded and helped run who is now a local elected. It is crazy to think you can HN-effortpost your way to changing the laws of the place you live in but I'm telling you right now that you can.This is a magical experience. I've done something similar in my university's CS department when I pointed out how the learning experience in the first programming course varies too much depending upon who the professor is.I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.

StefanBatory9 months ago

7/15, 90 seconds. I'll blame it on fact that I'm not English native speaker, right? Right?On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.

评论 #41278691 未加载

xanderlewis9 months ago

I feel like I recognise the comment about tensors from HN a few days ago, haha.

lostmsu9 months ago

I think this is a good joke on nay-sayers. But if author is here, I would like a clarification if user is picking the next token or the next word? Cause if it is the latter, I think this test is invalid.

评论 #41277848 未加载

globular-toast9 months ago

Everything I picked was grammatically correct, so I don't see the point. Is the point of a "language model" just to recall people's comments from the internet now?

评论 #41280988 未加载

mjcurl9 months ago

5/15, so the same as choosing the most common word.I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.

评论 #41277990 未加载

评论 #41277504 未加载

card_zero9 months ago

The LLMs are better than me at knowing the finer probabilities of next words, and worse than me at guessing the points being made and reasoning about that.

rlt9 months ago

Is this with the “temperature” parameter set to 0? Most LLM chatbots set it to something higher.It would be interesting to try varying it, as well as the seed.

评论 #41279896 未加载

efilife9 months ago

Tried to respond like a LLM would> You scored 7/15. The best language model, mistral-7b, scored 7/15.I guess it's a success

lelanthran9 months ago

This is a nonsense test. There is no context, so the 'next' word after the single word 'The' is effectively random.I'm pretty certain that LLMs are unable to work at all without context.

评论 #41281812 未加载

nyrikki9 months ago

7/10 This is more about set shattering than 'smarts'LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.You can unroll and cyclic graph into a dag, but you constrict the solution space.Take the 'spoken': sentence:"I never said she stole my money"And say it multiple times with emphasis on each word and notice how the meaning changes.That is text being a forgetful functor.As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.But that is existential quantification, limited based on your corpus based on pattern matching and finding.I guess if "Smart" is defined as pattern matching and finding it would apply.But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.

User239 months ago

With some brief experimentation ChatGPT also fails this test.

评论 #41277833 未加载

lemoncookiechip9 months ago

you: 6/15 (336sec)gpt-4o: 5/15gpt-4: 5/15gpt-4o-mini: 5/15llama-2-7b: 6/15llama-3-8b: 6/15 (Slowest Bot: 14sec)mistral-7b: 5/15unigram: 2/15

fidla9 months ago

Yes definitely

drakonka9 months ago

you: 5/15gpt-4o: 5/15gpt-4: 5/15gpt-4o-mini: 4/15llama-2-7b: 7/15llama-3-8b: 7/15mistral-7b: 7/15unigram: 4/15

lingualscorn9 months ago

The only ones I got right were ones where I had read the actual HN comment…

EugeneOZ9 months ago

Just proves why IQ tests are worthless.