TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: What benchmarks are you using to judge AI models?

4 点作者 cowpig14 天前
There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?<p>I use:<p>* Aider&#x27;s Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:<p>https:&#x2F;&#x2F;aider.chat&#x2F;docs&#x2F;leaderboards&#x2F;<p>* I generally assume OpenRouter usage to be an indicator of a model&#x27;s popularity, and by proxy, utility:<p>https:&#x2F;&#x2F;openrouter.ai&#x2F;rankings<p>* LLM-Stats has a lot of charts of benchmarks that I look at:<p>https:&#x2F;&#x2F;llm-stats.com&#x2F;

2 条评论

paulcole14 天前
&gt; There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally<p>Just pick one and use it. The ones you’ve heard of (if you are not obsessively refreshing AI model rankings pages) are basically the same.<p>I’m sure I’ll get a ton of pushback that the one somebody loves is obviously so much better than the other one, but whatever.<p>Just give me OpenAI’s most popular model, their fastest model, and their newest model. I’ll pick among those 3 based on what I’m prioritizing in the moment (speed, depth, everyday use).
kadushka14 天前
For me it&#x27;s the opposite - we don&#x27;t get enough models to test. In the last 6 months, we got Claude 3.7, OpenAI o1, Grok 3, Gemini 2.5 Pro, and OpenAI o3. That&#x27;s it - 5 frontier models. Not that hard to test each one of them manually, which I did for many hours and with many different tasks. o1 --&gt; o3 and 2.5 Pro are the ones I&#x27;m using the most.<p>I couldn&#x27;t care less about benchmarks - I know what these models are capable of from personal experience.