TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

GPT4 can surpass humans in Theory of Mind test, with appropriate prompt

2 pointsby rodoxcastaabout 2 years ago

1 comment

rodoxcastaabout 2 years ago
With 2-shot and chain of trough prompting, GPT4 got 100% right answers in the Theory of Mind tests of the researchers. Humans got 87%.<p>Some points addressed in the paper:<p>- Is the model emulating the reasoning of the 2-shot examples?<p>&gt; The Davinci-3 and GPT-4 models experienced increases in ToM performance from all of the classes of CoT examples that we tested: Photo examples, Non-ToM Inferential examples, and ToM examples. The mean accuracy increases for each model and each type of CoT example are shown in Figure 4, while the accuracy changes for individual ToM questions are shown in Figure S.1. Prompting with Inferential and Photo examples boosted the models’ performance on ToM scenarios even though these in-context examples did not follow the same reasoning pattern as the ToM scenarios. Therefore, our analysis suggests that the benefit of prompting for boosting ToM performance is not due to merely overfitting to the specific set of reasoning steps shown in the CoT examples. Instead, the CoT examples appear to invoke a mode of output that involves step-by-step reasoning, which improves the accuracy across a range of tasks.<p>- Is the test data included in the training?<p>&gt; The LLMs may have seen some ToM or Photo scenarios during their training phase, but data leakage is unlikely to affect our findings. First, our findings concern the change in performance arising from prompting, and the specific prompts used to obtain this performance change were novel materials generated for this study. Second, if the model performance relied solely on prior exposure to the training data, there should be little difference between zero-shot Photo and ToM performance (Figure 2), as these materials were published in the same documents; however, the zero-shot performance patterns were very different across Photo and ToM scenarios. Third, the LLM performance improvements arose when the models elaborated their reasoning step-by-step, and this elaborated reasoning was not part of the training data. Therefore, although some data leakage is possible, it is unlikely to affect our conclusions concerning the benefits of prompting.<p>Other highlights:<p>- With the prompting techniques 3.5-turbo got human level performance.<p>- In zero-shot GPT4 is already near human performance (90% of the human score)<p>- In zero-shot regime 3.5-turbo is worst than davinci-3, but much much better than it with prompting. This happens because turbo is too much cautious by default and often refuses to draw conclusions.