Hi HN!<p>This is Arjun and Saikat, and like other product engineers, we've been excited to build with LLMs. Getting powerful models available as off-the-shelf HTTP endpoints is a huge leap forward to integrate and ship ML to end-users.<p>While building on top of LLMs, we've also experienced the pain of non-deterministic behavior – especially for applications that require smaller models. Iterating through model configuration while ensuring no regressions across hundreds of scenarios is a tricky balance.<p>To make this easier, we built Empirical. Here’s a demo video: <a href="https://www.youtube.com/watch?v=p8gSGphcOSU" rel="nofollow">https://www.youtube.com/watch?v=p8gSGphcOSU</a><p>We've focused on:<p>- Fast iteration cycles and interactivity when you need to change the prompt or add a new sample. We wanted to build something that feels like “hot reload” for LLM development<p>- A capable UI that combines objective and subjective evaluation, since eye-balling outputs makes it easier to build intuition around model behavior<p>- Ability to customize which model to test, or how to score it
— with JavaScript (or Python, if you really must)<p>- Embedded analytics for evaluation results, powered by DuckDB under the hood (more coming up on this!)<p>You can try Empirical today – with a one line CLI command – locally or on CI/CD. And oh, Empirical is 100% open source – so file an issue and we’d be happy to make it work for your use-case<p>$ npx empiricalrun<p>GitHub: <a href="https://github.com/empirical-run/empirical">https://github.com/empirical-run/empirical</a><p>Docs: <a href="https://docs.empirical.run/" rel="nofollow">https://docs.empirical.run/</a>