1 点作者 wluk3 个月前

1 comment

wluk3 个月前

"We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1 preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack."<p>I'm hoping this study will prompt more development of anti-cheating frameworks in training and serving LLMs.

Demonstrating specification gaming in reasoning models

1 comment

Demonstrating specification gaming in reasoning models

1 comment