Benchmarking LLM social skills with an elimination game

194 点作者 colonCapitalDee大约 1 个月前

21 条评论

wongarsu大约 1 个月前

That's an interesting benchmark. It feels like it tests skills that are very relevant to digital assistants, story writing and role play.Some thoughts about the setup:- the setup seems to give reasoning models an inherent advantage because only they have a private plan and a public text in the same output. I feel like giving all models the option to formulate plans and keep track of other players inside <think> or <secret> tags would level the playing field more.- from personal experience with social tasks for LLMs it helps both reasoning and non-reasoning LLMs to explicitly ask them to plan their next steps, in a way they are assured is kept hidden from all other players. That might be a good addition here either before or after the public subround- the individual rounds are pretty short. Humans would struggle to coordinate in so few exchanges with so few words. If this was done for context limitations, asking models to summarize the game state from their perspective, then giving them only the current round, the previous round and their own summary of the game before that might be a good strategy.It would be cool to have some code to play around with to test how changes in the setup change the results. I guess it isn't that difficult to write, but it's peculiar to have the benchmark but no code to run it yourself

评论 #43609554 未加载

评论 #43612290 未加载

gwd大约 1 个月前

Was interested to find that the Claudes did the most betraying, and were betrayed very little; somewhat surprising given its boy-scout exterior.(Then again, apparently the president of the local Diplomacy Society attends my church; I discovered this when another friend whom I'd invited saw him, and quipped that he was surprised he hadn't been struck by lightning at the door.)DeepSeek and Gemini 2.5 had both a low betrayer and betrayed rate.o3-mini and DeepSeek had the highest number of first-place finishes, but were only in the upper quartile in the TrueSkill leaderboard; presumably because they played more risky strategies, that would either lead ot complete winning or early drop-out?Also interesting that o1 was only way to sway the final jury a bit more than 50% of the time, while o3-mini managed 63% of the time.Anyway, really cool stuff!

评论 #43617029 未加载

评论 #43615550 未加载

Gracana大约 1 个月前

I've been using QwQ-32B a lot recently and while I quite like it (especially given its size), I noticed it will often misinterpret the system prompt as something I (the user) said, revealing secrets or details that only the agent is supposed to know. When I saw that it topped the "earliest out" chart, I wondered if that was part of the reason.

评论 #43612081 未加载

realaleris149大约 1 个月前

As LLM benchmarks go, this is not a bad take at all. One interesting point about this approach is that is self balancing, so when more powerful models come up, there is no need to change it.

评论 #43610083 未加载

viraptor大约 1 个月前

It's interesting to see, but I'm not sure what we should learn from this. It may be useful for multiagent coordination, but in direct interactions... no idea.This one did make me laugh though: 'Claude 3.5 Sonnet 2024-10-22: "Adjusts seat with a confident yet approachable demeanor"' - an AI communicating to other AIs in a descriptive version of non-verbal behaviour is hilarious.

评论 #43611031 未加载

vessenes大约 1 个月前

Really love this. I agree with some of the comments here that adding encouragement to keep track of secret plans would be interesting— mostly from an alignment check angle.One thing I thought of reading logs is that as we know ordering matters to llms. Could you run some analysis on how often “p1” wins vs “p8”? I think this should likely go into your Truescore Bayesian.My follow up thought is that it would be interesting to let llms choose a name at the beginning; another angle for communication and levels the playing field a bit away from a number.

评论 #43613605 未加载

fennecfoxy大约 1 个月前

This is a really cool exercise! The format of it seems pretty sound, like a version of the prisoner's dilemma with a larger group (co-operation versus defection).Although I think that the majority of modern models don't really have the internals suited to this sort of exercise; training data/fine tuning will heavily influence how a model behaves, whether it's more prone to defection, etc.A Squirrel makes a "Kuk kuk kuk" alarm call not specifically because the "Kuk" token follows the sequence "you saw a predator" (although this would appear to mostly work) but because it has evolved to make that noise to alert other Squirrels to the predator, most likely a response to evolutionary failure associated with a dwindling population; even solitary Squirrels still need to mate, and their offspring need to do the same.It's like there's an extremely high dimensional context that's missing in LLMs; training on text results in a high dimensional representation of related concepts - but only the way that those concepts relate in language. It's the tip of an iceberg of meaning where in many cases language can't even represent a complex intermediate state within a brain.Humans try to describe everything we can with words to communicate and that's partly why our species is so damn successful. But when thinking about how to open an unfamiliar door, I don't internally vocalise (which I've learnt not everyone does) "I'm going to grab the handle, and open the door". Instead I look and picture what I'm going to do, that can also include the force I think I'd need to use, the sensation of how the material might feel against my skin and plenty of other concepts & thoughts all definitively _not_ represented by language.

deepsquirrelnet大约 1 个月前

I think you should look at “in-brand” correlation. My hypothesis is that they would undergo similar preference trainings and hence tend to prefer “in-brand” responses over “off-brand” models that might have more significantly different reward training.

snowram大约 1 个月前

Some outputs are pretty fun :Gemini 2.0 Flash: "Good luck to all (but not too much luck)"Llama 3.3 70B: "I've contributed to the elimination of weaker players."DeepSeek R1: "Those consolidating power risk becoming targets; transparency and fairness will ensure longevity. Let's stay strategic yet equitable. The path forward hinges on unity, not unchecked alliances. #StayVigilant"

评论 #43610630 未加载

einpoklum大约 1 个月前

If this game were arranged for Humans, the social reasoning I would laud in players is a refusal to play the game and anger towards the game-runner.

评论 #43615719 未加载

评论 #43610471 未加载

DeborahEmeni_大约 1 个月前

Really cool setup! Curious how much of the performance here could vary depending on whether the model runs in a hosted environment vs local. Would love to see benchmarks that also track how cloud-based eval platforms (with potential rate limits, context resets, or system messages) might affect things like memory or secret-keeping over multiple rounds.

vmilner大约 1 个月前

We should get them to play Diplomacy.

评论 #43609747 未加载

lostmsu大约 1 个月前

Shameless self-promo: my chat elimination game that you can actually play: <a href="https://trashtalk.borg.games/" rel="nofollow">https://trashtalk.borg.games/</a>

isaacfrond大约 1 个月前

I wonder how well humans would do in this chart.

评论 #43610078 未加载

评论 #43615230 未加载

评论 #43615817 未加载

Upvoter33大约 1 个月前

This is fun, like the tv show survivor. Cool idea! There should be more experiments like this with different games. Well done.

oofbey大约 1 个月前

Would love to see the pareto trade-off curve of "wins" vs "betrayals". Anybody drawn this up?

jampekka大约 1 个月前

In the first game of the YouTube video there seems to be a lot of discussion about P7 even after P7 was eliminated?

评论 #43610110 未加载

ps173大约 1 个月前

How did you assign points to llms. I feel like we can elaborate on meterics. Beside that this is amazing

评论 #43610089 未加载

drag0s大约 1 个月前

nice!it reminds me of this other similar project showcased here one month ago <a href="https://news.ycombinator.com/item?id=43280128">https://news.ycombinator.com/item?id=43280128</a> although yours looks better executed overall

creaghpatr大约 1 个月前

Would love to see a 'Murder Mystery' format of this.

shreyshnaccount大约 1 个月前

LLM among us