So I think this is an excellent post. Indeed, LLM maximalism is pretty dumb. They're awesome at specific things and mediocre at others. In particular, I get the most frustrated when I see people try to use them for tasks that need deterministic outputs <i>and the thing you need to create is already known statically</i>. My hope is that it's just people being super excited by the tech.<p>I wanted to call this out, though, as it makes the case that to improve any component (and really make it production-worthy), you need an evaluation system:<p>> Intrinsic evaluation is like a unit test, while extrinsic evaluation is like an integration test. You do need both. It’s very common to start building an evaluation set, and find that your ideas about how you expect the component to behave are much vaguer than you realized. You need a clear specification of the component to improve it, and to improve the system as a whole. Otherwise, you’ll end up in a local maximum: changes to one component will seem to make sense in themselves, but you’ll see worse results overall, because the previous behavior was compensating for problems elsewhere. Systems like that are very difficult to improve.<p>I think this makes sense from the perspective of a team with deeper ML expertise.<p>What it doesn't mention is that this is an enormous effort, made even larger when you don't have existing ML expertise. I've been finding this one out the hard way.<p>I've found that if you have "hard criteria" to evaluate (i.e., getting the LLM to produce a given structure rather than an open-ended output for a chat app) you can quantify improvements using Observability tools (SLOs!) and iterating in production. Ship changes daily, track versions of what you're doing, and keep on top of behavior over a period of time. It's arguably a lot less "clean" but it's way faster, and because it's working on the real-world usage data, it's really effective. An ML engineer might call that some form of "online test" but I don't think it really applies.<p>At any rate, there are other use cases where you really do need evaluations, though. The more important correct output is, the more it's worth investing in evals. I would argue that if bad outputs have high consequences, then maybe LLMs also aren't the right tech for the job, but that'll probably change in a few years. And hopefully making evaluations will be easier too.