TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

AI Does NYT Connections

32 点作者 mikehearn5 个月前

14 条评论

tianshuo5 个月前
Hmm, isn't there like five lives before "ending"? So instead of doing a "perfect run" there should be chances, and feedback, such as "one missing", like a real human player?
tantalor5 个月前
Why is Gemini given 4 Xs for #547?<p>It has &quot;Group 3&quot; correct. It should be marked as having 1&#x2F;4 groups correct.<p>Same thing happened on #535, Gemini actually got &quot;Group 1&quot; correct but was marked 0&#x2F;4 correct.
评论 #42368163 未加载
ravedave55 个月前
Argh put a spoiler cover over today&#x27;s at least!
troelsSteegin5 个月前
This is nicely presented. I would like to see the prompts to the respective services, however. Did I miss them? The &quot;side peek&quot; would be a natural place for them.
smusamashah5 个月前
These kind of tests (or may be all tests) should show *success rate* instead of a single pass&#x2F;fail.<p>I believe Claude or even Gemini can succeed if system prompt is improved e.g. tell it to re-evaluate it&#x27;s answer before finalising, can even tell it to do &quot;thinking&quot; within &lt;thinking&gt; tags. I use claude like that and it often goes over it&#x27;s answer and corrects itself within same reply. On the other hand it can also incorrectly assume it made a mistake and can sometimes uncorrect itself.<p>Edit: Using o1&#x27;s step by step problem solving example from OpenAI blog post made Claude go step by step in similar depth too. Could even do that here to get better success rate in non-o1 models.
alexarena5 个月前
This is very cool. It seems like the prompt is asking the LLM to one shot an answer. Have you tried asking it to make a group, confirm whether it&#x27;s correct, and repeat with the remaining words? (like a human would)
deskamess5 个月前
Connections is a great game to test AI. It really relies on the ambiguity and loosely connected aspects of culture and language. I am shocked at how well o1-pro does.
KaoruAoiShiho5 个月前
Beyond being able to solve Connections, can a LLM generate (good&#x2F;challenging&#x2F;solvable) connections? Would be pretty cool to be able to generate a test set.
评论 #42367010 未加载
评论 #42366977 未加载
tantalor5 个月前
&gt; Correct group with the wrong connection<p>This seems highly subjective. We should not care about this. The game is to connect the words, not find the connection. For human players, it doesn&#x27;t matter if you get the connection or not.
Workaccount25 个月前
This is completely unsurprising as the latent space that LLM&#x27;s rely on basically is a giant web of Connections.
world2vec5 个月前
Quite shocked that O1-Pro isn&#x27;t orders of magnitude better than O1 despite being 10x the price.<p>Cool benchmark nonetheless!
empath755 个月前
I&#x27;d love to see them try &quot;Only Connect&quot; puzzles which are _much_ harder.
ditto6645 个月前
Spoiler alert
评论 #42367003 未加载
zeroonetwothree5 个月前
LLMs don’t have “intelligence”.