LLMs, Theory of Mind, and Cheryl's Birthday

290 点作者 stereoabuse7 个月前

27 条评论

whack7 个月前

> At least with respect to this problem, they had no theory of mind.This is very interesting and insightful, but I take issue with the above conclusion. Your average software engineer would probably fail to code up a python solution to this problem. But most people would agree that the average software engineer, and the average person, possesses some theory of mind.This seems to be a pattern I'm noticing with AI. The goalposts keep moving. When I was a kid, the turing test was the holy grail for "artificial intelligence." Now, your run-of-the-mill LLM can breeze through the turing test. But no one seems to care. "They are just imitating us, that doesn't count." Every couple years, AI/ML systems make revolutionary advances, but everyone pretends it's not a big deal because of some new excuse. The latest one being "LLMs can't write a python program to solve an entire class of very challenging logic problems. Therefore LLMs possess no theory of mind."Let me stick my neck out and say something controversial. Are the latest LLMs as smart as Peter Norvig? No. Are they smarter than your average human? Yes. Can they outperform your average human at a randomly chosen cognitive task that has real-world applications? Yes. This is pretty darn revolutionary. We have crossed the rubicon. We are watching history unfold in real-time.

评论 #41756483 未加载

评论 #41746836 未加载

评论 #41754845 未加载

评论 #41746764 未加载

评论 #41758210 未加载

jawns7 个月前

A long time ago, I created a version of this challenge called "Cheryl's Murder."My notebook not only solves logical induction problems like "Cheryl's Birthday," but it also generates them.<a href="https://github.com/shaungallagher/cheryls-murder/blob/master/cheryls-murder.ipynb">https://github.com/shaungallagher/cheryls-murder/blob/master...</a>

评论 #41748189 未加载

评论 #41746996 未加载

oli56797 个月前

Gp1-o1 preview solves this puzzle correctly in 13 seconds and has a thorough logical deduction in the comments and explanation.I think it’s a bit unfair on llm to ask it to retrieve the puzzle definition from its training data. I posted the info on the puzzle from his notebook.<a href="https://chatgpt.com/share/670103ae-1c18-8011-8068-dd217937278a" rel="nofollow">https://chatgpt.com/share/670103ae-1c18-8011-8068-dd21793727...</a>

评论 #41748677 未加载

评论 #41751233 未加载

评论 #41750809 未加载

erwald7 个月前

o1 mini seems to get it on the first try (I didn't vet the code, but I tested it and it works on both examples provided in the notebook, `dates` and `gabe_dates`):<pre><code> from collections import defaultdict def find_cheryls_birthday(possible_dates): # Parse the dates into month and day dates = [date.split() for date in possible_dates] months = [month for month, day in dates] days = [day for month, day in dates] # Step 1: Albert knows the month and says he doesn't know the birthday # and that Bernard doesn't know either. This implies the month has no unique days. month_counts = defaultdict(int) day_counts = defaultdict(int) for month, day in dates: month_counts[month] += 1 day_counts[day] += 1 # Months with all days appearing more than once possible_months = [month for month in month_counts if all(day_counts[day] > 1 for m, day in dates if m == month)] filtered_dates = [date for date in dates if date[0] in possible_months] # Step 2: Bernard knows the day and now knows the birthday # This means the day is unique in the filtered dates filtered_days = defaultdict(int) for month, day in filtered_dates: filtered_days[day] += 1 possible_days = [day for day in filtered_days if filtered_days[day] == 1] filtered_dates = [date for date in filtered_dates if date[1] in possible_days] # Step 3: Albert now knows the birthday, so the month must be unique in remaining dates possible_months = defaultdict(int) for month, day in filtered_dates: possible_months[month] += 1 final_dates = [date for date in filtered_dates if possible_months[date[0]] == 1] # Convert back to original format return ' '.join(final_dates[0]) if final_dates else "No unique solution found." # Example usage: possible_dates = [ "May 15", "May 16", "May 19", "June 17", "June 18", "July 14", "July 16", "August 14", "August 15", "August 17" ] birthday = find_cheryls_birthday(possible_dates) print(f"Cheryl's Birthday is on {birthday}.")</code></pre>

评论 #41746691 未加载

评论 #41746751 未加载

评论 #41748487 未加载

ynniv7 个月前

The problem with evaluating LLMs is that there's a random component, and the specific wording of prompts is so important. I asked Claude to explain the problem, then write python to solve it. When it ran there was an exception, so I pasted that back in and got the correct answer. I'm not sure what this says about theory of mind (the first script it wrote was organized into steps based on who knew what when, so it seems to grok that), but the real lesson is that if LLMs are an emulation of "human" intelligence, they should probably be given a python interpreter to check their work.

评论 #41747446 未加载

评论 #41756391 未加载

评论 #41763922 未加载

jfcoa7 个月前

This seems like a terrible test case since python examples are readily available in the training data: <a href="https://rosettacode.org/wiki/Cheryl%27s_birthday" rel="nofollow">https://rosettacode.org/wiki/Cheryl%27s_birthday</a>It's interesting that so many of the model's fail to retrieve this, but any thta do solve it should clearly be able to do so with no reasoning/theory of mind.

评论 #41749299 未加载

评论 #41748527 未加载

pfisherman7 个月前

LLMs and NLP are to verbal reasoning what the calculator is to quantitative reasoning.Language and by extension verbal reasoning is full of ambiguity and semantic slipperiness. For example, what degree of semantic similarity distinguishes synonymous from synonym-ish concepts? When do we partition concepts into homonyms?I think part of the problem with how people evaluate LLMs is that the expectations that people have. Natural language != ontology. The expectation should be more Chomsky and less Boole. Asking it to solve math problems written in paragraph form is a waste of time. Use a calculator for that! Solving riddles? Code it up in prolog!Instead you should be thinking of what operations you can do on concepts, meaning, and abstract ideas! That is what these things do.

评论 #41747338 未加载

joe_the_user7 个月前

Deducing things from the inability of an LLM to answer a specific question seemed doomed by the "it will be able to on the next itteration" principle.It seems like the only way you could systematic chart the weaknesses of an LLM is by having a class of problems that get harder for LLMs at a steep rate, so a small increase in problem complexity requires a significant increase in LLM power.

评论 #41747224 未加载

评论 #41746876 未加载

评论 #41788023 未加载

extr7 个月前

this is an interesting problem but it’s more of a logic problem than a true test of theory of mind. when i think “theory of mind” i think being able to model an external agent with complete knowledge, incentives, and behavior. i would not doubt LLMs have something close to this for humans, almost by accident since they are trained on human outputs.

评论 #41747514 未加载

评论 #41747480 未加载

gkfasdfasdf7 个月前

This question was posed to o1, it is able to reason through it - but now I wonder if that is because the model is already aware of the puzzle.<a href="https://x.com/d_feldman/status/1834313124058726894" rel="nofollow">https://x.com/d_feldman/status/1834313124058726894</a>

评论 #41747089 未加载

tel7 个月前

I tried to replicate this and Claude 3.5 Sonnet got it correct on the first try. It generated a second set of dates which contained no solution so I asked it to write another python program that generates valid date sets.Here's the code it generated: <a href="https://gist.github.com/tel/8e126563d2d5fb13e7d53cf3adad862e" rel="nofollow">https://gist.github.com/tel/8e126563d2d5fb13e7d53cf3adad862e</a>To my test, it has absolutely no trouble with this problem and can correctly translate the "theory of mind" into a progressive constraint solver.Norvig is, of course, a well-respected researcher, but this is a bit disappointing. I feel confident he found that his tests failed, but to disprove his thesis (at least as is internally consistent with his experiment) we just need to find a single example of an LLM writing Python code that realizes the answer. I found that on the first try.I think it's possible that there exists some implementation of this problem, or something close enough to it, already in Claude's training data. It's quite hard to disprove that assertion. But still, I am satisfied with the code and its translation. To relate the word problem to this solution requires contemplation of the character's state-of-mind as a set of alternatives consistent with the information they've been given.

评论 #41751252 未加载

评论 #41751232 未加载

评论 #41751342 未加载

评论 #41751312 未加载

IanCal7 个月前

I'm not a huge fan of using these kind of riddles or gotchas. Other comments have riddle variants which feel also like ways of tripping someone up - if you don't spot the change you fail. And what's more the originals are things that lots of people struggle with (that's why they're riddles not basic questions).There's also little prompting, which feels like calling up a random person and demanding they solve a riddle straight away without talking it through.Part of the assumption here is that if llms don't write the right code they don't understand what people know. I'd wager that a huge number of people couldn't solve this puzzle yet fully understand what different people have their own internal thoughts and knowledge.

评论 #41748576 未加载

评论 #41749685 未加载

评论 #41749494 未加载

willguest7 个月前

This seems to amount to asking an LLM how it feels about Cheryl, discovering that it is performatively happy about her existence, and then deducing that the LLM has no capacity for genuine emotion, expressed in the form of logic.The faulty premise lies in the formulation of the test and makes the responses both predictable, but also does a disservice to 'mind' because it tries to interpret it in such a way that an LLM could begin to grapple with the basics, but not in a meaninful way.Perhaps it is useful to help build better context-specific logic flows (generally known as software) but it doesn't seem to provide any progress on the "theory of mind" front, which I guess is a borrowed notion.

diwank7 个月前

Script generated by o1-preview:<pre><code> ```python # List of possible dates dates = [ ('May', 15), ('May', 16), ('May', 19), ('June', 17), ('June', 18), ('July', 14), ('July', 16), ('August', 14), ('August', 15), ('August', 17) ] def solve_cheryls_birthday(dates): # Initial possible dates possible_dates = dates.copy() # Step 1: Albert's statement # Create a count of each day day_counts = {} for month, day in dates: day_counts[day] = day_counts.get(day, 0) + 1 # Filter out months where a unique day exists (Albert knows Bernard doesn't know) possible_months = set() for month in set(month for month, day in dates): month_days = [day for m, day in dates if m == month] if not any(day_counts[day] == 1 for day in month_days): possible_months.add(month) possible_dates = [ (month, day) for (month, day) in possible_dates if month in possible_months ] # Step 2: Bernard's statement # Recount the days in the filtered possible dates day_counts_in_possible = {} for month, day in possible_dates: day_counts_in_possible[day] = day_counts_in_possible.get(day, 0) + 1 # Bernard can now deduce the date; keep dates where the day is unique possible_dates = [ (month, day) for (month, day) in possible_dates if day_counts_in_possible[day] == 1 ] # Step 3: Albert's final statement # Recount the months in the possible dates month_counts_in_possible = {} for month, day in possible_dates: month_counts_in_possible[month] = month_counts_in_possible.get(month, 0) + 1 # Albert now knows the date; keep dates where the month is unique possible_dates = [ (month, day) for (month, day) in possible_dates if month_counts_in_possible[month] == 1 ] # The remaining date is Cheryl's birthday if len(possible_dates) == 1: return possible_dates[0] else: return None # Solve the problem birthday = solve_cheryls_birthday(dates) if birthday: print(f"Cheryl's birthday is on {birthday[0]} {birthday[1]}") else: print("Unable to determine Cheryl's birthday.") ``` </code></pre> Output:Cheryl's birthday is on July 16

AdieuToLogic7 个月前

What is a software program?The codification of a solution.What is a solution?An answer to a problem.What is a problem?The identification and expression of a need to be satisfied.What is a need?A uniquely human experience, one which only exists within the minds of people whom experience it.

评论 #41747650 未加载

RevEng7 个月前

It's important to remember that modern LLMs are trained on bloody everything. They know every common logic problem, at least when stated the way they would have seen it.If you want to test an LLM, always make up a new problem. It can be the same idea as an existing problem, but change all names and numbers.I tested if GPT 3.5 could recognize chaos theory. If I stated it as the typical "butterfly flaps its wings" it instantly recognized it as the chaos theory example. If I totally changed the problem statement, it correctly identified that weather isn't correlated with a single action by a single person, but it didn't associate it with chaos theory.

wanderingbort7 个月前

Related to this in asked LLMs to directly solve the same riddle but then obfuscated the riddle so it wouldn’t match training data and as a final test added extraneous information to distract them.Outside of o1, simple obfuscation was enough to throw off most of the group.The distracting information also had a relevant effect. I don’t think LLMs are properly fine tuned for prompters lying to them. With RAG putting “untrusted prose” into the prompt that’s a big issue.<a href="https://hackernoon.com/ai-loves-cake-more-than-truth" rel="nofollow">https://hackernoon.com/ai-loves-cake-more-than-truth</a>

johnobrien10107 个月前

The approach is fundamentally flawed. You can’t query an LLM as to whether it has a theory of mind. You need to analyze how its internal logic works.Imagine the opposite result had occurred, and the LLM had outputted something which was considered a theory of mind… Does that prove it has one, or that it was trained on some data that had something it used which made it sound like it has a theory of mind?

godelski7 个月前

I think the test is better than many other commenters are giving credit. It reminds me of responses to the river crossing problems. The reason people do tests like this is because we know the answer a priori or can determine the answer. Reasoning tests are about generalization, and this means you have to be able to generalize based on the logic.So the author knows that the question is spoiled, because they know that the model was trained on wiki. They also tested to see if the model is familiar with the problem in the first place. In fact, you too can confirm this by asking "What is the logic puzzle, Cheryl's birthday?" and they will spit you out the correct answer.The problem also went viral, so there are even variations of this. That should tell us that the model has not just been trained on it, but that it has seen it in various forms and we know that this increases its ability to generalize and perform the task.So then we're left with reasoning. How do we understand reasoning? It is the logical steps. But we need to make sure that this is distinct from memorization. So throwing in twists (as people do in the river puzzles) is a way to distinguish memory from logic. That's where these models fail.People always complain that "oh, but humans can't do it." I refer to this as "proof by self-incompetence." (I also see it claimed when it isn't actually true) But not everybody reasons, and not all the time (trivial cases are when you're asleep or in a coma, but it also includes things like when you're hangry or just dumb). Humans are different from LLMs. LLMs are giving it 100%, every time. "Proof by self-incompetence" is an exact example of this, where the goal is to explain a prior belief. But fitting data is easy, explaining data is hard (von Neumann's Elephant).There's also a key part that many people are missing in the analysis. The models were explicitly asked to *generalize* the problem.I'll give some comments about letting them attempt to solve iteratively, but this is often very tricky. I see this with the river crossing puzzles frequently, where there is information leakage passed back to the algo. Asking a followup question like "are you sure" is actually a hint. You typically don't ask that question when it is correct. Though newer models will not always apologize for being wrong, when actually correct, when they are sufficiently trained on that problem. You'll find that in these situations if you run the same prompt (in new clean sessions) multiple times that the variance in the output is very low.Overall, a good way to catch LLMs in differentiating reasoning from memorization is getting them to show their work, the steps in between. It isn't uncommon for them to get the right answer but have wrong steps, even in math problems. This is always a clear demonstration of memorization rather than reasoning. It is literally the subtly that matters.I suspect that one of the difficulties in humans analyzing LLMs is that there is no other entity that is capable of performing such feats that does not also have a theory of mind and a world model. But a good analogy might be in facts that you know, but not understanding why they are "the answer." I'm sure there's many people who have memorized complexities for many sorting algos or leet code problems and couldn't derive the answer themselves.But I really don't understand why we *need* LLMs to reason? A dictionary memorizes things, and so does wikipedia. Their lack in ability to reason does not make them any less marvelous of inventions/tools. But maybe, if we're looking to create intelligent and thinking machines, it isn't as simple as scale. We love simple things, but few things are simple and correct (though far more things are simple and approximately correct).

评论 #41747073 未加载

评论 #41746993 未加载

m3kw97 个月前

could be an architectual issue with the LLMs because you need to juggle a lot of states just from one statement regarding a big problem. Sort of like if you ask it to write an app like facebook. It would give you a bunch of crap, which is worse.

nextworddev7 个月前

The majority of humans in flesh can't solve the problem - so we need alternate measures for judging theory of mind capabilities in LLMs

评论 #41746911 未加载

评论 #41751544 未加载

评论 #41747117 未加载

评论 #41746924 未加载

dmead7 个月前

I wonder if they are any unique properties of those programs that we can figure out what stack overflow or textbooks they're copying.

JPLeRouzic7 个月前

Most LLMs won a T-shirt with the following inscription: " I am not as smart as Peter Norvig "!

mark_l_watson7 个月前

Nice! I use various LLMs many times a day as a limited coding tool and something to bounce ideas off of, and it is impossible to not think about how LLMs work and what their limitations are.I tried just asking Claude Sonet to solve the Cheryl’s Birthday word problem, changing the dates. Pretty cool that it can solve it as a word problem, and LLMs will keep getting better at coding.As a slight tangent: I used a combination of Gemini, GPT-4o, and Claude last week to write Common Lisp code for a simple RDF data store and the subset of SPARQL queries that I thought I would need in embedded Common Lisp applications. This process was far from automatic: I initially provided almost two pages of English instructions, and I had to help debug non-working code by adding debug statements and then show the models the code with print statements and the new output. I also did the optional thing of asking for stylistic changes. TLDR: saved me time and I liked the final code.I always enjoy it when people like Peter and Karpathy write relatively simple code to share ideas. I am a fairly good coder (I had the meaningless title Master Software Engineer at Capital One) but I like to read other people’s code, and I must admit that I spend more time reading code on GitHub than I spend reading technical papers.

fny7 个月前

How does solving a logic puzzle imply a theory of mind? I don’t mean to say that LLMs don’t have a theory of mind, just that deductive reasoning does not amount to empathetic evaluations of how someone else thinks and feels……unless you’re a programmer.

mrbungie7 个月前

Not really about Theory of Mind, but in the same line, I remember the other day someone argued with me that LLMs model the world, rather than just modelling language (that may represent the world).I kept thinking about that problem and plausible experiments to show my point that LLMs are dumb about the physical world, even if they know perfectly how it works in terms of language/representation. So I thought, what happens if I give an LLM an image and I ask a representation of said image in ASCII art (obviously without relying in Python and the trivial pixel intensity to character transform it usually proposes). Remember:- LLMs should've been trained with a lot of RGB image training data with associated captions => So it should understand images very well.- LLMs should've been trained with a lot of ASCII training data with associated captions => So it should draw/write ASCII like an expert. Plus, it understands vision apparently (managed as tokens), so it should do well.But it can't do a decent translation that captures the most interesting features of an image into ASCII art (I'm pretty sure a human with an hour of time should be able to do it, even if its awful ASCII art). For example, I uploaded an image macro meme with text and two pictures of different persons kind of looking at each other. The ASCII art representation just showed two faces, that didn't look at each other but rather away from each other. It just does not "understand" the concept of crossing sights (even if it "understands" the language and even image patches when you ask about where are they looking at, it will not draw that humanly important stuff by itself).These things just work with tokens, and that is useful and seems like magic in a lot of domains. But there is no way in hell we are going to get into AGI without a fully integrated sensor platform that can model the world in its totality including interacting with it (i.e. like humans in training, but not necessarily in substrate nor training time hopefully). And I really don't know how something that has a very partial model of the world can have a Theory of Mind.

评论 #41747457 未加载

评论 #41747544 未加载

aithrowawaycomm7 个月前

AI researchers need to learn what terms like "theory of mind" actually mean before they write dumb crap like this. Theory of mind is about attributing mental states to others, not information. What Norvig has done here is present a logic puzzle, one that works equally well when the agents are Prolog programs instead of clever children. There's no "mind" in this puzzle at all. Norvig is being childishly ignorant to call this "theory of mind." It's hard to overstate my contempt for this kind of useless junk science, especially when it comes from an impressive pedigree.Of course he is hardly the only offender: arrogant disregard for psychology is astonishingly common among LLM researchers. Maybe they should turn off ChatGPT and read a book.

评论 #41747942 未加载

评论 #41747645 未加载

评论 #41749319 未加载

评论 #41747110 未加载