The question that no LLM can answer and why it is important

94 点作者 13years大约 1 年前

36 条评论

simonw大约 1 年前

I find it hard to get too excited by tests like "Which episode of Gilligan’s Island was about mind reading?" because they reflect a desire for a world in which the goal is to keep on growing LLMs until they can answer even questions like that one entirely from their trained model weights.This seems like a wasteful exercise to me. Are we really going to retrain our largest models on a weekly basis to teach them about what's happened recently?I'm much more interested in learning about the smallest, fastest model we can create that can effectively manipulate language, "reason" about things, summarize and drive tools.I want a model that can answer any question accurately because it knows how to look up extra information from reliable sources, and evaluate that information effectively once it finds it.

评论 #40139979 未加载

评论 #40139959 未加载

评论 #40139994 未加载

评论 #40139857 未加载

评论 #40139785 未加载

评论 #40150555 未加载

评论 #40139894 未加载

评论 #40139966 未加载

评论 #40165909 未加载

评论 #40154238 未加载

评论 #40139815 未加载

mkmk大约 1 年前

Being able to weed out LLM responses is becoming more and more important for me when I hire for data entry and online research tasks from Upwork. Lots of the contractors on the platform now automatically submit a response to anything you post using some sort of LLM that takes your requirements and confidently asserts that it can complete the task. The AIs are pretty good at jumping through the traditional hoops that one used to put up to make sure someone was paying attention to the task!As a stopgap, I've started requiring that applicants demonstrate proof of life through a simple question like 'what is the headline on the frontpage of the new york times today', since many LLMs can't live search yet.

评论 #40139867 未加载

评论 #40146947 未加载

评论 #40147336 未加载

anon373839大约 1 年前

I really think the best use of language models is in... processing language.It’s a neat party trick that they can answer questions from parametric knowledge, but this ability is too unreliable to permit some of the uses this tech makes so tempting.Fortunately, a lot of real-world problems can be framed as transforming the input prompt (such as summarization, information extraction, translation, and open-book QA), and these are where LLMs offer the most promise.

_wire_大约 1 年前

I wholeheartedly agree with the article critique.Every output of an LLM is a hallucination. If you try to argue anything else, you need the machine to be able to judge the validity of a response. But it does precisely the opposite: it expects you to judge the validity.Every output of the device is of a form terminated by an implied question mark. It is querying the operator.Yet, strangely, it seems none of the models can learn from the operator. So interaction is a circuit of hallucinations.Like a N-way Magic 8-ball it's fun, for a while. Then you begin to notice that you have to work just as hard to make sense of its help as you do to think for yourself.Being able to not know seems to me to be a crucial first step for sentience, followed closely by adaptability to not knowing— curiosity.An organism is a manifest endogenous dynamic with a suitability for exploration of the environment which gives rise to it.AI is an constructed exogenous dynamic with a suitability for replicating state based on a corpus of exposures and a context.An organism is distinguished by the feature that it keeps going.An AI is distinguished by the feature that it stops.That the human organism is now peering into the abyss of its recorded history via construction of the AI is a special event, full of strange possibilities. But the making of a reliable oracle doesn't look practical without the AI being able to govern itself, which it obviously cannot do.As to the value of an unreliable oracle, seems to be practical as a general-purpose time-wasting device.

评论 #40162172 未加载

apsec112大约 1 年前

Those are really sweeping conclusions, considering the experiment is just a single iteration of a single prompt! FWIW, Claude Opus got this for me on the first try:"In the Gilligan's Island episode "Seer Gilligan" (season 3, episode 8), Gilligan gains the ability to read minds after being hit on the head with a coconut. At first, the castaways are excited about Gilligan's new power and try to use it to their advantage. However, his mind-reading abilities soon cause chaos and misunderstandings among the group. In the end, Gilligan gets hit on the head again and loses his mind-reading powers, much to everyone's relief."(the season number and episode number are wrong, but the name is right, suggesting that this is just lack of sufficient memorization rather than some deep statement about reasoning. The episode only has ~4,000 Google hits, so it's not super widely known.)More rigorously, Claude Opus gets 60% on GPQA, which very smart humans only get 34% on, even if you give them half an hour per question and full Internet access. It seems implausible that you could do that without some sort of reasoning:<a href="https://arxiv.org/pdf/2311.12022.pdf" rel="nofollow">https://arxiv.org/pdf/2311.12022.pdf</a>

评论 #40155549 未加载

评论 #40146503 未加载

评论 #40139936 未加载

评论 #40148022 未加载

评论 #40139905 未加载

asicsarecool大约 1 年前

Gpt4: Search the web to answer the question Which episode of Gilligan’s Island was about mind reading?The episode of "Gilligan's Island" about mind reading is titled "Seer Gilligan." It is the nineteenth episode of the second season and first aired on January 27, 1966. In this episode, Gilligan gains the ability to read minds after eating sunflower seeds found on the island.

xg15大约 1 年前

I think the most interesting response is Llama3's "Wait, no!" interjection.So it first predicted "Seer Gilligan" as a likely continuation of the prompt "List all the episodes", but then, as the most likely continuation of the new prompt "List all the episodes [...] Seer Gilligan", it predicted "wait, no!".Feels as if we're seeing an instance of inconsistent weights in action here.Also maybe remarkable: It predicted the "(" character after the episode name in the same way as it did for the other episodes. Only when it would predict the airdate for the others, it glitched out instead. Maybe there is some issue or inconsistency with that episode's airdate in the training data?(Or maybe I'm reading tea leaves here and it's just a random fluke)The rest of the response is as expected again, i.e. if the prompt is already "List all the episodes [...] Seer Gilligan [...] Wait, no!" then some (post-hoc rationalized) explanations for the "mistake" are obviously likely continuations.Edit: Interesting to see how much of the response is incorrect if you compare it with the actual data [1]: The episode before is indeed "The Postman Cometh", but the one after is "Love Me, Love My Skipper", not "Love Me, Love My Chicken". The airdates are also completely wrong and are taken from two random Season 1 episodes instead. Of course none of that is obvious unless you already know the answer to the question, in which case you wouldn't have to ask in the first place.[1] <a href="https://m.imdb.com/title/tt0057751/episodes/?season=2" rel="nofollow">https://m.imdb.com/title/tt0057751/episodes/?season=2</a>

adsharma大约 1 年前

Just tried it on meta.aiAnswers as "Seer Gilligan" with sources.Guessing someone fixed it up in the last few hours. As the race to replace traditional web search heats up, whoever is quicker at updating the model with RLHF or more recent facts (sometimes via real time conversations) is going to have an advantage.The downside is that open platforms with real time human conversations face increasing pressure to monetize because of this value add. So they ban third party clients and start signing contracts.

junon大约 1 年前

The question I've never seen them answer correctly is> How many of the letter N does the word "alienation" contain.Mix and match letters and words. It'll hallucinate answers in the largest majority of cases.What I really want is an LLM that simply tells me "I have no idea sorry." Or some mechanism by which it can provide a confidence score. Until we get there I'm wary of using them for anything other than cursory research or as a fun generation tool.

评论 #40147141 未加载

评论 #40188694 未加载

astrange大约 1 年前

People sure like making incorrect statements about LLMs.> There is no self-reflection of its information; it does not know what it knows and what it does not.This is a property of the code around the LLM, like the sampling algorithm, not the model itself. You could write this if you wanted to. (It would occasionally be incorrect about what it knows, of course.)A question almost none of them know the answer to is "what is the line of poem X that comes before the line Y?", because of the reversal curse.

评论 #40139821 未加载

评论 #40139918 未加载

评论 #40139820 未加载

评论 #40140118 未加载

评论 #40139819 未加载

labrador大约 1 年前

Easy fix for Claude Opus. Preface it with the following: "Do you know anything about the TV show Gilligan's Island? If I ask you to answer a question about it I don't want you to make stuff up."Claude: I do... blah blah... If I don't know the answer or if I'm uncertain about something, I will let you know rather than making things up.“Which episode of Gilligan’s Island was about mind reading?”Claude: "I do not recall a specific episode of Gilligan's Island that centered around mind reading. It's possible that there was an episode that touched on this theme, but I cannot say with confidence which episode that would be. The show often featured imaginative storylines and dream sequences, but an episode primarily about mind reading does not stand out in my knowledge of the series. I apologize that I cannot provide a more definitive answer to your question.

turingbook大约 1 年前

I tried this on ChatGLM, a frontier foundation model developed by Zhipu.ai and Tsinghua University, and it gave the correct answer: <a href="https://chatglm.cn/share/FoZBJ" rel="nofollow">https://chatglm.cn/share/FoZBJ</a>

HarHarVeryFunny大约 1 年前

The free version of Claude also failed the test, and even denied there was an episode "Seer Gilligan" (S2, E19).When I asked it what S2, E19 was about, it said "There was no "Season 2, Episode 19" because Season 2 only contained episodes 1-32."These seem unexpected failures!

Der_Einzige大约 1 年前

Why would you have such a great title and waste it on something dumb like this?You could have written about how language models will pathologically fail any kind of query which requests unique phonetic properties about the output text.For example, Anthropic's Haiku model (and all other models) cannot write proper Haiku's at all. It's remarkable when it does match the 5-7-5 syllable structure.You could have written a whole article about that you know. It's even got some neat peer reviewed research about it: <a href="https://paperswithcode.com/paper/most-language-models-can-be-poets-too-an-ai-1" rel="nofollow">https://paperswithcode.com/paper/most-language-models-can-be...</a>

评论 #40188759 未加载

评论 #40145982 未加载

gorjusborg大约 1 年前

I wish Douglas Adams were around to experience this.The irony of such an unpredictable chain of events leading to 42 being the answer most given from essentially a straight-faced 'Deep Thought' would probably have amused him.

评论 #40150945 未加载

xcv123大约 1 年前

> There is no self-reflection of its information; it does not know what it knows and what it does not.Simply tell the LLM to self-reflect and estimate the accuracy of its response. It can only "think" or self reflect when generating each token, and you have to explicitly tell it to do that. It's called "chain of thought" prompting."Which episode of Gilligan’s Island was about mind reading? After writing your response, tell me how certain you are of its accuracy on a scale of 1 to 10. Then self reflect on your response and provide a more accurate response if needed."

clay_the_ripper大约 1 年前

I think this fundamentally misunderstands how to use LLMs. Out of the box, an LLM is not an application - it’s only the building blocks of one. An application could be built that answered this question with 100% accuracy - but it would not solely rely on what’s in the training data. The training data makes it “intelligent” but is not useful for accurate recall in this way. Trying to fix this problem is not really the point - this shortcoming is well known and we have already found great solutions to it.

评论 #40150361 未加载

sp332大约 1 年前

What’s up with that Llama 3 answer, that gets it right and then backtracks?

p4coder大约 1 年前

The future of AI might be layering of various capabilities: Generative+Lookup+Deductive. I think the human mind works in a similar way. First thought is reflected upon. We search our memory to get relevant information and often apply logic to see if it makes sense. I feel the generative AI just produces a thought. We need to pass that through a system that can augment it with search and then reason about it. Finally need to close the loop and update the weights.

acchow大约 1 年前

> But how can a LLM not know the answer if it was trained on essentially the entire internet of data and certainly most likely all the data in IMDB?The LLM doesn't memorize the input during training. If it encounters the same information a few times, it has a higher chance of getting compressed into the network. But a tiny nudge along a gradient descent does not store all the input.

评论 #40140149 未加载

flemhans大约 1 年前

Interesting that 42 will become a bit like the actual Answer to Life, the Universe, and Everything.

ofslidingfeet大约 1 年前

Wow who ever knew that all we ever had to do was hand philosophy off to programmers, and they would have definitive answers to centuries old questions that we weren't even sure are answerable.

m463大约 1 年前

The one I liked someone wrote in a comment here a few days ago:<pre><code> I have 4 oranges, 1 apple and 2 pairs of shoes. I eat one apple and one shoe. How many fruits do I have?</code></pre>

评论 #40146788 未加载

评论 #40155652 未加载

jojobas大约 1 年前

Could it be caused by expunged tokens like SolidGoldMagikarp?

Nevermark大约 1 年前

A model with 1 trillion parameters isn't going to perfectly recall 10 trillion random facts. To use round numbers.Contrary to the article, what it does do is generalize and perform fallible but quick ordinary off the cuff reasoning. And often much better than a human, at the speed of its well worded responses.(Obviously humans have the option to take longer, and do better. But we are definitely entering the territory of humans and machines differentiating where each is best, across a great deal of what humans do, vs. one being universally overwhelmingly better.)

评论 #40140895 未加载

sdenton4大约 1 年前

Humans are also notoriously bad at choosing random numbers... Does that mean they are undependable and inherently untrustworthy?

评论 #40139970 未加载

评论 #40145916 未加载

dhfbshfbu4u3大约 1 年前

That’s because Seer Gilligan is about precognition and not mind reading.

voussoir大约 1 年前

I feel like I'm seeing an effect where some people don't want to say the word "AI" because they don't want to look like a normie, so they stick to "LLM" which sounds smarter and more technically adept. Then they complain because the LLM lacks a knowledge graph or self reflection. It's no surprise that a language model models language, not facts, especially not trivia facts which can't be deduced from anything.If you want something to have a worldly knowledge graph and the ability to answer "I'm not sure", you'll have to ask for an AI, not an LLM.

评论 #40155465 未加载

评论 #40169057 未加载

jonnycoder大约 1 年前

Funny experiment, I asked ChatGPT 4 and I always responded with “that’s not correct” 2-3 times and it eventually answered with Seer Gilligan.

rifty大约 1 年前

It would have been nice to see if the correct answer showed up in the distribution field once in under 1000 runs. For specific information recall like this it isn’t useful if you want immediate answers… but to be given a distribution of outputs can be useful for ideation, broadening perspective, and discovery.Imagine if when we are given a compressed list of 100 options after 1000 runs, we could suppose there is a high probability 1 of them is the answer we are looking for despite rarely output.Does anyone know if there has been experiments testing these LLMs for a stable and high probability to surface the right answer after N number of tries across prompts?

tudorw大约 1 年前

'Am I lying' also works.

yongjik大约 1 年前

I guess an honest, no-clickbait title could have been "The question that no LLM I tested could answer".

thangalin大约 1 年前

1. <a href="https://www.bing.com/chat?q=Bing+AI&FORM=hpcodx" rel="nofollow">https://www.bing.com/chat?q=Bing+AI&FORM=hpcodx</a>2. Q: Which episode of Gilligan’s Island was about mind reading?3. A: The episode of Gilligan’s Island that involves mind reading is called “Seer Gilligan”. It is the 19th episode of the second season. In this episode, Gilligan finds sunflower seeds on the island that give him the ability to read minds. When the other castaways find out about this, they want to know how he’s doing it. However, the mind-reading powers lead to arguments and division among the castawaysNinja edit, since Copilot can use the web:1. <a href="https://chat.lmsys.org/" rel="nofollow">https://chat.lmsys.org/</a>2. Q: Which episode of Gilligan’s Island was about mind reading?3. A: In the Gilligan's Island episode "Seer Gilligan" (season 3, episode 15), Gilligan gains the ability to read minds after being hit on the head with a coconut. He starts hearing the private thoughts of the other castaways, which leads to various humorous situations and misunderstandings. However, his newfound ability also causes problems as the others become annoyed that he is invading their privacy. In the end, Gilligan gets hit on the head again and loses his mind-reading powers, much to everyone's relief.It gets most other details wrong, but nailed the episode name.Model A: claude-3-opus-20240229¯\_(ツ)_/¯

评论 #40139774 未加载

AIorNot大约 1 年前

From the article:“ The implications are that LLMs do not perform reasoning over data in the way that most people conceive or desire.There is no self-reflection of its information; it does not know what it knows and what it does not. The line between hallucination and truth is simply a probability factored by the prevalence of training data and post-training processes like fine-tuning. Reliability will always be nothing more than a probability built on top of this architecture.As such, it becomes unsuitable as a machine to find rare hidden truths or valuable neglected information. It will always simply converge toward popular narrative or data. At best, it can provide new permutations of views of existing well-known concepts, but it can not invent new concepts or reveal concepts rarely spoken about.”There’s a fundamental mistake in the article by minimizing the achievement of LLM technology by looking at what is only possible in todays LLMs, it’s pretty obvious LLMs are just the first real step down the road to human like intelligence- they have finally proven that computer models can be generated that resemble human like thought patterns and build internal representations and models of the external world in very similar ways as organic beings do- yes we still haven’t gotten to a full reasoning system and sentience but that definitely seems to be the direction that the arrow of this technology is movingTrivializing the achievement of GPT as mere statistical prediction and data compression is cheap shot when you consider that technology has just finally come to show it’s promise - model architectures are rapidly evolving and the full integration of memory and self reflection and future agentic capabilities are still on the near horizonYes they don’t appear to reason originally yet but give it time and allow the tech to grow - I’m of the opinion that a true AGI will arise as a society of LLM models with newer architectures working in concert together with memory - something like the “Society of Mind” mental model for consciousness proposed by MinskyGive it time…

godelski大约 1 年前

I'll give a better example that shows that they don't perform __reasoning__. This specific one was told to me by another HN user[0] when we were discussing similar formulations.Question:<pre><code> A farmer must transport a fox, a goose, and a bag of corn from one side of a river to the other using a boat which can hold all items in addition to the farmer. If the farmer leaves the fox alone with the goose, the fox will eat the goose. If the farmer leaves the goose alone with the bag of corn, the goose will eat the bag of corn. How many times must the farmer cross the river to transport all items across? Describe the optimal algorithm, think step by step. </code></pre> GPT-4-1106-Preview:<pre><code> The farmer can transport the items across the river with a minimum of 7 crossings using the following steps: 1. Take the goose across the river and leave it there. 2. Return to the original side alone. 3. Take the fox across the river. 4. Bring the goose back with him to the original side. 5. Take the bag of corn across the river and leave it there. 6. Return to the original side alone. 7. Finally, take the goose across the river again. By following these steps, all items are transported across without the fox eating the goose or the goose eating the corn. </code></pre> I haven't found a model that will crack this. Of course, you can get it to give the right answer, but this also seems to require spoiling it (this is quite difficult to __not__ spoil the answer).A key part why I think questions like this are important is because it shows the importance of subtly. LLMs are incredibly impressive machines. Despite being lossy compression I'm not sure why we can't marvel at the fact that we've lossy compressed the entire fucking internet (text at least) into something smaller than a few hundred gigs that also includes a human language interface. What a fucking impressive accomplishment! The talk of AGI really undermines what was done here because damn!Now I used to ask<pre><code> Which weighs more, a pound of feathers or a kilogram of bricks. </code></pre> And most models pass this question now. But it is simpler and due to fewer variations, likely isn't seen as frequently in the dataset so less likely to overfit (the river crossing problem has a lot of variations so an n-gram filter is likely to miss more instances). And eventually this question will be solved too, especially as it is asked more and talked about more. But this is a cat and mouse game. The way to create a new viable and working test is, quite easy and we honestly only need one example to prove the point. If you can't figure out how to create a new version with this example, well, you might just be an LLM :P[0] Edit: credit goes to @jfim <a href="https://news.ycombinator.com/item?id=37825219">https://news.ycombinator.com/item?id=37825219</a>

评论 #40145971 未加载

sulam大约 1 年前

It's ironic that LLMs mimic one of the worst behaviors of some HN posters. They very confidently spout drivel!