TechEcho

I’ve seen the standard evals and benchmarks for new LLMs, but they don’t really capture how I actually use them. My own test is pretty specific: whenever a new LLM drops, I ask it to “Write an advanced three.js music visualizer.” Then I compare it to older models by checking:<p>1. Does it use a recent version of three.js?<p>2. Does the generated code run out of the box?<p>3. How complex/innovative is the visualizer?<p>I’m really curious to hear about other people’s “real-world” benchmarks. What’s your personal test prompt or scenario that reveals whether a new LLM is actually useful for you? How do you decide if it’s truly better than the last version?

Ask HN: How do you personally evaluate LLMs?

no comments

Ask HN: How do you personally evaluate LLMs?

no comments