Hey HN! Creator here. I recently found myself writing evaluations for actual production LLM projects, and kept facing the same dilemma: either reinvent the wheel or use a heavyweight commercial system with tons of features I don't need right now.<p>Then it hit me - evaluations are just (kind of) tests, so why not write them as such using pytest?<p>That's why I created pytest-evals - a lightweight pytest plugin for building evaluations. It's intentionally not a sophisticated system with dashboards (and not suitable as a "robust" solution). It's minimalistic, focused, and definitely not trying to be a startup<p><pre><code> # Predict the LLM performance for each case
@pytest.mark.eval(name="my_classifier")
@pytest.mark.parametrize("case", TEST_DATA)
def test_classifier(case: dict, eval_bag, classifier):
# Run predictions and store results
eval_bag.prediction = classifier(case["Input Text"])
eval_bag.expected = case["Expected Classification"]
eval_bag.accuracy = eval_bag.prediction == eval_bag.expected
# Now let's see how our app performing across all cases...
@pytest.mark.eval_analysis(name="my_classifier")
def test_analysis(eval_results):
accuracy = sum([result.accuracy for result in eval_results]) / len(eval_results)
print(f"Accuracy: {accuracy:.2%}")
assert accuracy >= 0.7 # Ensure our performance is not degrading
</code></pre>
Would love to hear your thoughts and if you find this useful, a GitHub star would be appreciated
The pytest-evals README mentions that it's built on pytest-harvest, which works with pytest-xdist and pytest-asyncio.<p>pytest-harvest: <a href="https://smarie.github.io/python-pytest-harvest/" rel="nofollow">https://smarie.github.io/python-pytest-harvest/</a> :<p>> <i>Store data created during your pytest tests execution, and retrieve it at the end of the session, e.g. for applicative benchmarking purposes</i>