科技回声

There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?I use:* Aider's Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:https://aider.chat/docs/leaderboards/* I generally assume OpenRouter usage to be an indicator of a model's popularity, and by proxy, utility:https://openrouter.ai/rankings* LLM-Stats has a lot of charts of benchmarks that I look at:https://llm-stats.com/

> There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotallyJust pick one and use it. The ones you’ve heard of (if you are not obsessively refreshing AI model rankings pages) are basically the same.I’m sure I’ll get a ton of pushback that the one somebody loves is obviously so much better than the other one, but whatever.Just give me OpenAI’s most popular model, their fastest model, and their newest model. I’ll pick among those 3 based on what I’m prioritizing in the moment (speed, depth, everyday use).

For me it's the opposite - we don't get enough models to test. In the last 6 months, we got Claude 3.7, OpenAI o1, Grok 3, Gemini 2.5 Pro, and OpenAI o3. That's it - 5 frontier models. Not that hard to test each one of them manually, which I did for many hours and with many different tasks. o1 --> o3 and 2.5 Pro are the ones I'm using the most.I couldn't care less about benchmarks - I know what these models are capable of from personal experience.

Ask HN: What benchmarks are you using to judge AI models?

2 条评论

Ask HN: What benchmarks are you using to judge AI models?

2 条评论