Hey HN, Joe and Ethan from Tonic.ai here. We just released a new open-source python package for evaluating the performance of Retrieval Augmented Generation (RAG) systems.<p>Earlier this year, we started developing a RAG-powered app to enable companies to talk to their free-text data safely.<p>During our experimentation, however, we realized that using such a new method meant that there weren’t industry-standards for evaluation metrics to measure the accuracy of RAG performance. We built Tonic Validate Metrics (tvalmetrics, for short) to easily calculate the benchmarks we needed to meet in building our RAG system.<p>We’re sharing this python package with the hope that it will be as useful for you as it has been for us and become a key part of the toolset you use to build LLM-powered applications. We also made Tonic Validate Metrics open-source so that it can thrive and evolve with your contributions!<p>Please take it for a spin and let us know what you think in the comments.<p>Docs: <a href="https://docs.tonic.ai/validate" rel="nofollow noreferrer">https://docs.tonic.ai/validate</a><p>Repo: <a href="https://github.com/TonicAI/tvalmetrics">https://github.com/TonicAI/tvalmetrics</a><p>Tonic Validate: <a href="https://validate.tonic.ai" rel="nofollow noreferrer">https://validate.tonic.ai</a>
Related — are there any good end to end benchmark datasets for RAG? End to end meaning not just (context, question, answer) tuples (which ignores retrieval) but
(Document , question, answer). I know NQ (Natural Questions) is one such dataset:<p><a href="https://ai.google.com/research/NaturalQuestions" rel="nofollow noreferrer">https://ai.google.com/research/NaturalQuestions</a><p>But I do t see this dataset mentioned much in RAG discussions.
How does it compare to <a href="https://github.com/explodinggradients/ragas">https://github.com/explodinggradients/ragas</a>
This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5/4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?
So disapointed :( I saw Metric and RAG (I thought it would be Red-Amber-Green) and I was hoping for some cool metrics/heatmap thingie...<p>I wish you the best though!
if you build a dataset of question with responses to test you rag app with this metrics package, how do you know whether the distribution of questions match in any way with the distribution of question you'll get from the app in production? using a hand made dataset of questions and responses could introduce a lot of bias into your rag app.