In September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: <a href="https://dustalov.github.io/llmfao/" rel="nofollow noreferrer">https://dustalov.github.io/llmfao/</a><p>I also wrote a detailed post describing the methodology and analysis: <a href="https://evalovernite.substack.com/p/llmfao-human-ranking" rel="nofollow noreferrer">https://evalovernite.substack.com/p/llmfao-human-ranking</a><p>[1]: <a href="https://twitter.com/_jasonwei/status/1707104739346043143" rel="nofollow noreferrer">https://twitter.com/_jasonwei/status/1707104739346043143</a><p>[2]: <a href="https://benchmarks.llmonitor.com/" rel="nofollow noreferrer">https://benchmarks.llmonitor.com/</a><p>Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions.