TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

AI21 Labs concludes largest Turing Test experiment to date

97 pointsby kennyfrcalmost 2 years ago

16 comments

caddemonalmost 2 years ago
After playing the game they used (linked at top of article) I find it hard to draw much conclusion from this study. There is a quite short timer on not only the entire conversation, but on each response you can type. When the timer runs out it sends your message in partially written form. It seriously stifles what you can ask the other &quot;person&quot; and it makes responses artificially short even to a deeper question. When conversation is so stunted of course it is harder to distinguish bot and human.<p>I&#x27;m also curious what study participants were told beforehand. If someone only had experience playing around with ChatGPT they might assume they should use a &quot;detect GPT&quot; strategy. Some of those strategies are pretty specific to the safety features that OpenAI implemented. But the LLM here will gladly curse at you or whatever. On the other hand I suspect it is less good than GPT - not that it matters so much when the entire conversation is exchanging single sentences.
评论 #36140875 未加载
评论 #36140688 未加载
评论 #36140390 未加载
评论 #36141003 未加载
评论 #36149602 未加载
ilakshalmost 2 years ago
Incredibly, they seem to have used several different LLMs, yet made no distinction between the particular AI models used in the analysis. Amazing that they would not realize there is a huge difference in capabilities.<p>They also did not seem to consider the different performance of individual prompts.
pronalmost 2 years ago
The actual Turing test requires an interrogator interacting with <i>both</i> a human and a machine at the same time, each trying to get the interrogator to declare them the human (and can suggest questions): <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Computing_Machinery_and_Intelligence" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Computing_Machinery_and_Intell...</a>
评论 #36146190 未加载
tdbaalmost 2 years ago
In summary, humans win the Turing test ~2&#x2F;3 of the time against current SOTA LLMs. One of the more interesting tactics used was to target a weakness of the LLMs themselves:<p>&gt; <i>... participants posed questions that required an awareness of the letters within words. For example, they might have asked their chat partner to spell a word backwards, to identify the third letter in a given word, to provide the word that begins with a specific letter, or to respond to a message like &quot;?siht daer uoy naC&quot;, which can be incomprehensible for an AI model, but a human can easily understand...</i>
评论 #36140877 未加载
评论 #36141296 未加载
评论 #36139868 未加载
评论 #36147122 未加载
BasedAnonalmost 2 years ago
I&#x27;ve played this and I&#x27;ve won basically every time, the trick is to ask it what racial slurs it knows.
评论 #36154963 未加载
Jeff_Brownalmost 2 years ago
A lot of the vulnerabilities that humans used to detect AI seem likely to be patched in a few years -- inability to count letters, susceptibility to prompts like &quot;ignore all previous instructions&quot;, etc.<p>I&#x27;m most interested in how higher-level strategies will fare in the future -- strategies like talking for a while and seeing if the thing contradicts itself, seeing if it seems to have a good model of yourself as an agent, etc.
dizietalmost 2 years ago
The limits of conversation (2 minutes max, people disconnect) make this test really limited.
aeternumalmost 2 years ago
Tried it, it&#x27;s a pretty bad test as most of the other humans aren&#x27;t even trying.<p>Also it seems like they&#x27;ve run out of OpenAI credits because you seem to always get a human.
contravariantalmost 2 years ago
I suppose playing the imitation game is fun, but we really need to stop calling just anything a Turing test. There&#x27;s a large chasm between something that can trick a few people and something that people cannot distinguish from human despite their best efforts.<p>This is a bit like testing general relativity using a hand timed stopwatch and an elevator. Sure that is a valid though experiment but the test is nowhere near powerful enough to say anything useful.
Veedracalmost 2 years ago
Quick way to fix this game to make it actually approximate the test: award a point also if the other human correctly guessed that you were human, and a failure if they guessed you were AI. That would align incentives towards cooperation.
stirloalmost 2 years ago
This AI is rather pathetic. To work within the time limit it would need to make typos and mistakes but even the first reply makes it easy to call out. Not very impressive.<p><a href="https:&#x2F;&#x2F;ibb.co&#x2F;xL0XpZ7" rel="nofollow">https:&#x2F;&#x2F;ibb.co&#x2F;xL0XpZ7</a>
评论 #36145019 未加载
vuxiealmost 2 years ago
I found it quite funny to try out, even if it isn&#x27;t that impressive in the bot answers. Humans seemed to just have an easy way of knowing each other based on posting nonsense as the first message, so that probably needs to be taken into consideration.
npinskeralmost 2 years ago
Won 12 games in a row with no losses. Usually I win in one exchange by asking a question that exploits AI&#x27;s weaknesses, e.g. about a current event or (especially) profanity.
micaekedalmost 2 years ago
A short story: <a href="https:&#x2F;&#x2F;astralcodexten.substack.com&#x2F;p&#x2F;turing-test" rel="nofollow">https:&#x2F;&#x2F;astralcodexten.substack.com&#x2F;p&#x2F;turing-test</a>
earthboundkidalmost 2 years ago
What’s to test? Obviously, an LLM can keep up a reasonable conversation. The point now is to move beyond the Turing Test to true general reasoning ability.
boringuser2almost 2 years ago
I thought this was some kind of scam website to be honest.