In September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: <a href="https://dustalov.github.io/llmfao/" rel="nofollow noreferrer">https://dustalov.github.io/llmfao/</a><p>I also wrote a detailed post describing the methodology and analysis: <a href="https://evalovernite.substack.com/p/llmfao-human-ranking" rel="nofollow noreferrer">https://evalovernite.substack.com/p/llmfao-human-ranking</a><p>[1]: <a href="https://twitter.com/_jasonwei/status/1707104739346043143" rel="nofollow noreferrer">https://twitter.com/_jasonwei/status/1707104739346043143</a><p>[2]: <a href="https://benchmarks.llmonitor.com/" rel="nofollow noreferrer">https://benchmarks.llmonitor.com/</a><p>Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions.
This is really cool, nice work. Did you try out any of the grading yourself to compare it to the contractors you used? One thing I've found, especially for coding questions is that models can produce an answer that _looks_ great, but then turns out to use libraries or methods that don't exist. And that human graders tend to rate these highly since they don't actually run the code.