Hey HN!<p>Word Game Bench is a fun benchmark for evaluating language models on word puzzle games. It is a relatively hard benchmark, where no model currently scores above 50% average win rate.<p>Currently, the models are evaluated on 2 tasks:<p>1. Wordle is a word puzzle game where the player has to guess a 5-letter word in 6 attempts. For each letter in the guessed word, the player receives feedback on whether the letter is in the target word and in the correct position.<p>2. Connections is a word association game where the player has to group 16 words into 4 categories of 4 words each. The player doesn't know the categories beforehand, and has to group the shuffled words based on their associations<p>I believe there are several advantages of this benchmark:<p>- Instead of prompting the model once and getting back a response, in this benchmark the model interacts with the game and produces its final output as a result of its own actions/predictions in the previous steps of the game, as well as the feedback it receives from the environment.<p>- Tokenizers are one of the main pain points of language models today. Wordle, by providing character level feedback for the guessed word, tests how well the model incorporates this new knowledge into making a next guess satisfying the constraints of the environment.<p>- On the other hand, Connections is a game that requires the model to reason about the abstract relationships between words and group them into categories.<p>- "Controversially", I don't plan to maintain a fixed evaluation set for reproducibility purposes because of the commonly occurring test set leakage. Each daily puzzle is evaluated only once!<p>Let me know what you think!<p>Page: <a href="https://wordgamebench.github.io" rel="nofollow">https://wordgamebench.github.io</a>