TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Kagi LLM Benchmarking Project

64 点作者 bx37610 个月前

5 条评论

xianshou10 个月前
This is <i>almost</i> perfect.<p>The gold standard for LLM evaluation would have the following qualities:<p>1. Categorized (e.g. coding, reasoning, general knowledge)<p>2. Multimodal (at least text and image)<p>3. Multiple difficulties (something like &quot;GPT-4 saturates or scores &gt;90%&quot;, a la MMLU, &quot;GPT-4 scores 20-80%&quot;, and &quot;GPT-4 scores &lt; 10%&quot;)<p>4. Hidden (under 10% of the dataset publicly available, enough methodological detail to inspire confidence but not enough to design to the test set)<p>The standard model card suite with MMLU, HumanEval etc. has already been optimized to the point of diminishing value - Goodhart&#x27;s law in action. Meanwhile, arena Elo (<a href="https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;lmsys&#x2F;chatbot-arena-leaderboard" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;spaces&#x2F;lmsys&#x2F;chatbot-arena-leaderboar...</a>) is extremely useful, but also has the drawback of reflecting median-voter preferences that will not necessarily correlate with true intelligence as capabilities continue to advance, in the same sense as how the doctor with the best bedside manner is not necessarily the best doctor.<p>Until that happens, I&#x27;ll pay attention to every eval I can find, but am also stuck asking &quot;how many r&#x27;s are in strawberry?&quot; and &quot;draw a 7-sided stop sign&quot; to get a general impression of intelligence independent of gameable or overly general benchmarks.<p>But all that aside:<p><pre><code> Model | Score </code></pre> ----------------------------------------------<p><pre><code> GPT-4o | 52 Llama 3.1 405B | 50 Claude 3.5 Sonnet | 46 Mistral Large | 44 Gemini 1.5 Pro | 12 </code></pre> What an incredible contrast to MMLU, where all of these models score in the 80-90% range! For what it&#x27;s worth, these scores also fall much closer to my impressions from daily use. Gemini is awful, Sonnet and 4o are amazing, and the new Llama puts fine-tunable, open-source 4o in the hands of anyone with a mini-cluster.
评论 #41112419 未加载
评论 #41112083 未加载
评论 #41112007 未加载
评论 #41122220 未加载
tunesmith10 个月前
Having very little familiarity of models other than using ChatGPT+, I can&#x27;t quite make sense of how it scores GPT 3.5, GPT 4, and GPT 4o. Going from 3.5 to 4 felt like a revolutionary step up to me, and I still haven&#x27;t quite grasped the quality improvement between 4 and 4o - 4o just seems a little chattier to me than 4. (My perceptions are through the web, not through the API.)
Tiberium10 个月前
Would be nice if they at least published the system prompts used (they can affect the performance of the models), and whether they use few-shot prompting or not.
评论 #41124809 未加载
attentive10 个月前
I wonder why their Groq llama-3.1-70b-versatile gets 81t&#x2F;sec when I get 250t&#x2F;sec on the same model.
xfalcox10 个月前
Is it just me or the 2 Llama 3.1 providers are the two know for running quantization levels that undermine Llama perf ?