TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Llmfao – Human-Ranked LLM Leaderboard with Sixty Models

2 点作者 scoresmoke超过 1 年前
In September 2023, I noticed a tweet [1] on difficulties with LLM evaluation, which resonated with me a lot. A bit later, I spotted a nice LLMonitor Benchmarks dataset [2] with a small set of prompts and a large set of model completions. I decided to make my attempt without running a comprehensive suite of hundreds of benchmarks: <a href="https:&#x2F;&#x2F;dustalov.github.io&#x2F;llmfao&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;dustalov.github.io&#x2F;llmfao&#x2F;</a><p>I also wrote a detailed post describing the methodology and analysis: <a href="https:&#x2F;&#x2F;evalovernite.substack.com&#x2F;p&#x2F;llmfao-human-ranking" rel="nofollow noreferrer">https:&#x2F;&#x2F;evalovernite.substack.com&#x2F;p&#x2F;llmfao-human-ranking</a><p>[1]: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;_jasonwei&#x2F;status&#x2F;1707104739346043143" rel="nofollow noreferrer">https:&#x2F;&#x2F;twitter.com&#x2F;_jasonwei&#x2F;status&#x2F;1707104739346043143</a><p>[2]: <a href="https:&#x2F;&#x2F;benchmarks.llmonitor.com&#x2F;" rel="nofollow noreferrer">https:&#x2F;&#x2F;benchmarks.llmonitor.com&#x2F;</a><p>Unfortunately, I did my analysis before the Mistral AI model was released, but published it after the model was released. I’d be happy to add it to the comparison if I had their completions.

1 comment

maxrmk超过 1 年前
This is really cool, nice work. Did you try out any of the grading yourself to compare it to the contractors you used? One thing I&#x27;ve found, especially for coding questions is that models can produce an answer that _looks_ great, but then turns out to use libraries or methods that don&#x27;t exist. And that human graders tend to rate these highly since they don&#x27;t actually run the code.
评论 #37851277 未加载