科技回声

Board Game Bench is an arena for comparing the performance of LLMs on competitive board games. While board games are simple for humans to play, LLMs struggle to even understand the rules well enough to consistently make valid moves in many situations. For example, none of the Scrabble games in this arena end in a complete game, it always comes down to how far the LLMs get before finding themselves unable to make a valid rule.<p>Since it's a competitive setup, and there are hundreds of board games that could be implemented, this arena approach shouldn't become instantly saturated like other benchmarks, although it's certainly possible for individual labs to finetune their models for the specific games selected.<p>A notable gap is the exclusion of o1 and Google's Gemini's 2.5. I may add o1 if there's enough interest, but the arena is a bit expensive to pay for out of pocket, and Gemini's rate limits were too low for me to add it right now.

Show HN: Board Game Bench – arena-based evaluation of reasoning LLMs

暂无评论

Show HN: Board Game Bench – arena-based evaluation of reasoning LLMs

暂无评论