TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Llmfao – Human-Ranked LLM Leaderboard with Sixty Models

2 pointsby scoresmokeover 1 year ago
In September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: <a href="https:&#x2F;&#x2F;dustalov.github.io&#x2F;llmfao&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;dustalov.github.io&#x2F;llmfao&#x2F;</a><p>I also wrote a detailed post describing the methodology and analysis: <a href="https:&#x2F;&#x2F;evalovernite.substack.com&#x2F;p&#x2F;llmfao-human-ranking" rel="nofollow noreferrer">https:&#x2F;&#x2F;evalovernite.substack.com&#x2F;p&#x2F;llmfao-human-ranking</a><p>[1]: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;_jasonwei&#x2F;status&#x2F;1707104739346043143" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;_jasonwei&#x2F;status&#x2F;1707104739346043143</a><p>[2]: <a href="https:&#x2F;&#x2F;benchmarks.llmonitor.com&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;benchmarks.llmonitor.com&#x2F;</a><p>Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions.

1 comment

maxrmkover 1 year ago
This is really cool, nice work. Did you try out any of the grading yourself to compare it to the contractors you used? One thing I&#x27;ve found, especially for coding questions is that models can produce an answer that _looks_ great, but then turns out to use libraries or methods that don&#x27;t exist. And that human graders tend to rate these highly since they don&#x27;t actually run the code.
评论 #37851277 未加载