It's a neat idea, though not what I expected from the title talking about "smart" :)<p>You might want to replace the single page format with showing just one question at a time, and giving instant feedback on after each answer.<p>First, it'd be more engaging. Even the small version of the quiz is a bit long for something where you don't know what the payoff will be. Second, you'd get to see the correct answer while still having the context on why you replied the way you did.
I made a little game/quiz where you try to guess the next word in a bunch of Hacker News comments and compete against various language models. I used llama2 to generate three alternative completions for each comment creating a multiple choice question. For the local language models that you are competing against, I consider them having picked the answer with the lowest total perplexity of prompt + answer. I am able to replicate this behavior with the OpenAI models by setting a logit_bias that limits the llm to pick only one of the allowed answers. I tried just giving the full multiple choice question as a prompt and having it pick an answer, but that led to really poor results. So I'm not able to compare with Claude or any online LLMs that don't have logit_bias.<p>I wouldn't call the quiz fun exactly. After playing with it a lot I think I've been able to consistently get above 50% of questions right. I have slowed down a lot answering each question, which I think LLMs have trouble doing.
<p><pre><code> you: 4/15
gpt-4o: 0/15
gpt-4: 1/15
gpt-4o-mini: 2/15
llama-2-7b: 2/15
llama-3-8b: 3/15
mistral-7b: 4/15
unigram: 1/15
</code></pre>
Seems like none of us is really better than flipping a coin, so I'd wager that you cannot accurately predict the next word with the given information.<p>If one could instead sort the answers by likelihood and got scored based on how high one ranked the correct answer, things would probably look better than random.<p>Also I wonder how these LLMs were prompted. Were they just used to complete the text, or where they put in a "mood" where they would try to complete the text in the original author's voice?<p>Obviously as as human I'd try to put myself in the author's head and emulate their way of speaking, whereas an LLM might just complete things in its default voice.
Nice. I found you can beat this by picking the word least likely to be selected by a language model, because it seems like the alternative choices are generated by an LLM. “Pick the outlier” is the best strategy.<p>This is presumably also a simply strategy for detecting AI content in general - see how many “high temperature” choices it makes.
> You scored 11/15. The best language model, llama-2-7b, scored 10/15.<p>I see that you get a random quiz every time, so results aren't comparable between people. I think I got an easy one. Neat game! If you could find a corpus that makes it easy for average humans to beat the LLMs, and add some nice design, maybe Wordle-style daily challenge plus social sharing etc, I could see it going viral just as a way for people to "prove" that they are "smarter" than AI.
Got 8/15, best AI model got 7/15, and unigram got 1/15.<p>Finally a use for all the wasted hours I’ve spent on HN — my next word prediction is marginally better than that of the AI.
This is the best interactive website about LLMs at a meta level (so excluding prompt interfaces for actual AIs) that I've seen so far.<p>Quizzes can be magical.<p>Haven't seen any cooler new
language-related interactive fun-project on the web since:<p><a href="https://wikispeedruns.com/" rel="nofollow">https://wikispeedruns.com/</a><p>It would be great if the quiz included an intro or note about the training data, but as-is it also succeeds because it's obvious from the quiz prompts/questions that they're related to HN comments.<p>Sharing this with a general audience could spark funny discussions about bubbles and biases :)
I don't quite understand, what makes "Okay I've" more correct than "Okay so"? No meaningful context was provided here, how do we know "Okay I've" was at all meaningfully correct?<p>For the longer comments I understand, but for the ones where it's 1 or 2 words and many of the options are correct English phrases, I don't understand why there's bias towards one? Wouldn't we need a prompt here?<p>Also, I got bored halfway through and selected "D" for all of them
If the samples came from HN, I wonder how likely it is that the text is already a part of a dataset (ie common crawl snapshot) so that the LLMs have already seen them?<p>edit: judging from the comments I saw, they were all quite recent, so I guess this isn't happening. Though I do know that ChatGPT can sometimes use a Bing search tool during chats, which can actually link to recently indexed text, but I highly doubt that the gpt4o-mini API model is doing that.
Related:<p><i>Who's Smarter: AI or a 5-Year-Old?</i><p><a href="https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/" rel="nofollow">https://nautil.us/whos-smarter-ai-or-a-5-year-old-776799/</a><p>(<a href="https://news.ycombinator.com/item?id=41263363">https://news.ycombinator.com/item?id=41263363</a>)
This is just a test of how likely you are to generate the same word <i>as the LLM</i>. The LLM does not produce the "correct" next word as there are multiple correct words that fit grammatically and can be used to continue the sentence while maintaining context.<p>I don't see what this has to do with being "smarter" than anything. Example:<p>1. I see a business decision here.
Arm cores have licensing fees attached to them.
Arm is becoming ____<p>a) ether<p>b) a<p>c) the<p>d) more<p>But who's to say which is "correct"? Arm is becoming a household name. Arm is becoming the premier choice for new CPU architectures. Arm is becoming more valuable by the day. Any of b), c), or d) are equally good choices. What is there to be gained in divining which one the LLM would pick?
For anyone else daring the full 100 question quiz: you need to get at least a third right to be considered better than guessing by traditional statistical standards. (You'd need more than half to be better than LLMs.)
You scored 6/15. The best language model, gpt-4o, scored 6/15. The unigram model, which just picks the most common word without reading the prompt, scored 2/15.<p>Keep in mind that you took 204 seconds to answer the questions, whereas the slowest language model was llama-3-8b taking only 10 seconds!
Was mine broken? One of my prompts was just '>'. So of course I guessed a random word. The answer key showed I got it wrong, but showed the right answer inserted into a longer prompt. Or is that how it's supposed to work?
This isn't really the challenge (loss function) that language models are trained on. It's not a simple next-word challenge, they get more context, see how BERT was trained as a reference.
Like a ML model I would prefer being scored with cross entropy and not right/wrong. Like, I might guess wrong but it might not be that far off in likelihood.
So... If I picked the same results, in the same timeframe... And I don't think glue should go on pizza... Does that mean LLMs are completely useless to me?
I like the website, but it could be a bit more explicit about the point it's trying to make. Given that a lot of people tend to think of LLM as somehow a thinking entity rather than a statistical model for guessing the most likely next word, most will probably look at these questions and think the website is broken.
Of course not, but that does not mean LLMs will lead to AGI. We might never build AGI in fact: <a href="https://www.lycee.ai/blog/why-no-agi-openai" rel="nofollow">https://www.lycee.ai/blog/why-no-agi-openai</a>
>the quintessential language model task of predicting the next word?<p>Based on what? The whole test is flawed because of this. Even different LLMs would choose different answers and there's no objective argument to make for which one is the best.
> 8. All of local politics in the muni I live in takes place in a forum like this, on Facebook[.]
The electeds in our muni post on it; I've gotten two different local laws done by posting there (and I'm working on a bigger third); I met someone whose campaign I funded and helped run who is now a local elected. It is crazy to think you can HN-effortpost your way to changing the laws of the place you live in but I'm telling you right now that you can.<p>This is a magical experience. I've done something similar in my university's CS department when I pointed out how the learning experience in the first programming course varies too much depending upon who the professor is.<p>I've never experienced this anywhere else. American politicians at all levels don't appear to be the least bit responsive to the needs and issues of anyone but the wealthy and powerful.
7/15, 90 seconds.
I'll blame it on fact that I'm not English native speaker, right? Right?<p>On a more serious note it was a cool thing to go through! It seemed like something that should have been so easy at first glance.
I think this is a good joke on nay-sayers. But if author is here, I would like a clarification if user is picking the next token or the next word? Cause if it is the latter, I think this test is invalid.
Everything I picked was grammatically correct, so I don't see the point. Is the point of a "language model" just to recall people's comments from the internet now?
5/15, so the same as choosing the most common word.<p>I think I did worse when the prompt is shorter. It just becomes a guessing game then and I find myself thinking more like a language model.
The LLMs are better than me at knowing the finer probabilities of next words, and worse than me at guessing the points being made and reasoning about that.
Is this with the “temperature” parameter set to 0? Most LLM chatbots set it to something higher.<p>It would be interesting to try varying it, as well as the seed.
This is a nonsense test. There is no context, so the 'next' word after the single word 'The' is effectively random.<p>I'm pretty certain that LLMs are unable to work at all without context.
7/10 This is more about set shattering than 'smarts'<p>LLMs are effectively DAGs, they literally have to unroll infinite possibilities in the absence of larger context into finite options.<p>You can unroll and cyclic graph into a dag, but you constrict the solution space.<p>Take the 'spoken': sentence:<p>"I never said she stole my money"<p>And say it multiple times with emphasis on each word and notice how the meaning changes.<p>That is text being a forgetful functor.<p>As you can describe PAC learning, or as compression, which is exactly equivalent to the finite set shattering above, you can assign probabilities to next tokans.<p>But that is existential quantification, limited based on your corpus based on pattern matching and finding.<p>I guess if "Smart" is defined as pattern matching and finding it would apply.<p>But this is exactly why there was a split between symbolic AI, which targeted universal quantification and statistical learning, which targets existential quantification.<p>Even if ML had never been invented, I would assume that there were mechanical methods to stack rank next tokens from a corpus.<p>This isn't a case of 'smarter', but just different. If that difference is meaningful depends on context.