TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Board Game Bench – arena-based evaluation of reasoning LLMs

2 点作者 bjterry大约 2 个月前
Board Game Bench is an arena for comparing the performance of LLMs on competitive board games. While board games are simple for humans to play, LLMs struggle to even understand the rules well enough to consistently make valid moves in many situations. For example, none of the Scrabble games in this arena end in a complete game, it always comes down to how far the LLMs get before finding themselves unable to make a valid rule.<p>Since it&#x27;s a competitive setup, and there are hundreds of board games that could be implemented, this arena approach shouldn&#x27;t become instantly saturated like other benchmarks, although it&#x27;s certainly possible for individual labs to finetune their models for the specific games selected.<p>A notable gap is the exclusion of o1 and Google&#x27;s Gemini&#x27;s 2.5. I may add o1 if there&#x27;s enough interest, but the arena is a bit expensive to pay for out of pocket, and Gemini&#x27;s rate limits were too low for me to add it right now.

暂无评论

暂无评论