There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?<p>I use:<p>* Aider's Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:<p>https://aider.chat/docs/leaderboards/<p>* I generally assume OpenRouter usage to be an indicator of a model's popularity, and by proxy, utility:<p>https://openrouter.ai/rankings<p>* LLM-Stats has a lot of charts of benchmarks that I look at:<p>https://llm-stats.com/
> There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally<p>Just pick one and use it. The ones you’ve heard of (if you are not obsessively refreshing AI model rankings pages) are basically the same.<p>I’m sure I’ll get a ton of pushback that the one somebody loves is obviously so much better than the other one, but whatever.<p>Just give me OpenAI’s most popular model, their fastest model, and their newest model. I’ll pick among those 3 based on what I’m prioritizing in the moment (speed, depth, everyday use).
For me it's the opposite - we don't get enough models to test. In the last 6 months, we got Claude 3.7, OpenAI o1, Grok 3, Gemini 2.5 Pro, and OpenAI o3. That's it - 5 frontier models. Not that hard to test each one of them manually, which I did for many hours and with many different tasks. o1 --> o3 and 2.5 Pro are the ones I'm using the most.<p>I couldn't care less about benchmarks - I know what these models are capable of from personal experience.