TechEcho

2 comments

Hey HN! Creator here. I recently found myself writing evaluations for actual production LLM projects, and kept facing the same dilemma: either reinvent the wheel or use a heavyweight commercial system with tons of features I don't need right now.Then it hit me - evaluations are just (kind of) tests, so why not write them as such using pytest?That's why I created pytest-evals - a lightweight pytest plugin for building evaluations. It's intentionally not a sophisticated system with dashboards (and not suitable as a "robust" solution). It's minimalistic, focused, and definitely not trying to be a startup<pre><code> # Predict the LLM performance for each case @pytest.mark.eval(name="my_classifier") @pytest.mark.parametrize("case", TEST_DATA) def test_classifier(case: dict, eval_bag, classifier): # Run predictions and store results eval_bag.prediction = classifier(case["Input Text"]) eval_bag.expected = case["Expected Classification"] eval_bag.accuracy = eval_bag.prediction == eval_bag.expected # Now let's see how our app performing across all cases... @pytest.mark.eval_analysis(name="my_classifier") def test_analysis(eval_results): accuracy = sum([result.accuracy for result in eval_results]) / len(eval_results) print(f"Accuracy: {accuracy:.2%}") assert accuracy >= 0.7 # Ensure our performance is not degrading </code></pre> Would love to hear your thoughts and if you find this useful, a GitHub star would be appreciated

westurner4 months ago

The pytest-evals README mentions that it's built on pytest-harvest, which works with pytest-xdist and pytest-asyncio.pytest-harvest: <a href="https://smarie.github.io/python-pytest-harvest/" rel="nofollow">https://smarie.github.io/python-pytest-harvest/</a> :> Store data created during your pytest tests execution, and retrieve it at the end of the session, e.g. for applicative benchmarking purposes

Show HN: Pytest-evals – Simple LLM apps evaluation using pytest

2 comments

Show HN: Pytest-evals – Simple LLM apps evaluation using pytest

2 comments