TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

GPT4 can surpass humans in Theory of Mind test, with appropriate prompt

2 点作者 rodoxcasta大约 2 年前

1 comment

rodoxcasta大约 2 年前
With 2-shot and chain of trough prompting, GPT4 got 100% right answers in the Theory of Mind tests of the researchers. Humans got 87%.<p>Some points addressed in the paper:<p>- Is the model emulating the reasoning of the 2-shot examples?<p>&gt; The Davinci-3 and GPT-4 models experienced increases in ToM performance from all of the classes of CoT examples that we tested: Photo examples, Non-ToM Inferential examples, and ToM examples. The mean accuracy increases for each model and each type of CoT example are shown in Figure 4, while the accuracy changes for individual ToM questions are shown in Figure S.1. Prompting with Inferential and Photo examples boosted the models’ performance on ToM scenarios even though these in-context examples did not follow the same reasoning pattern as the ToM scenarios. Therefore, our analysis suggests that the benefit of prompting for boosting ToM performance is not due to merely overfitting to the specific set of reasoning steps shown in the CoT examples. Instead, the CoT examples appear to invoke a mode of output that involves step-by-step reasoning, which improves the accuracy across a range of tasks.<p>- Is the test data included in the training?<p>&gt; The LLMs may have seen some ToM or Photo scenarios during their training phase, but data leakage is unlikely to affect our findings. First, our findings concern the change in performance arising from prompting, and the specific prompts used to obtain this performance change were novel materials generated for this study. Second, if the model performance relied solely on prior exposure to the training data, there should be little difference between zero-shot Photo and ToM performance (Figure 2), as these materials were published in the same documents; however, the zero-shot performance patterns were very different across Photo and ToM scenarios. Third, the LLM performance improvements arose when the models elaborated their reasoning step-by-step, and this elaborated reasoning was not part of the training data. Therefore, although some data leakage is possible, it is unlikely to affect our conclusions concerning the benefits of prompting.<p>Other highlights:<p>- With the prompting techniques 3.5-turbo got human level performance.<p>- In zero-shot GPT4 is already near human performance (90% of the human score)<p>- In zero-shot regime 3.5-turbo is worst than davinci-3, but much much better than it with prompting. This happens because turbo is too much cautious by default and often refuses to draw conclusions.