40 pointsby Ephil012over 1 year ago

Hey HN, Joe and Ethan from Tonic.ai here. We just released a new open-source python package for evaluating the performance of Retrieval Augmented Generation (RAG) systems.Earlier this year, we started developing a RAG-powered app to enable companies to talk to their free-text data safely.During our experimentation, however, we realized that using such a new method meant that there weren’t industry-standards for evaluation metrics to measure the accuracy of RAG performance. We built Tonic Validate Metrics (tvalmetrics, for short) to easily calculate the benchmarks we needed to meet in building our RAG system.We’re sharing this python package with the hope that it will be as useful for you as it has been for us and become a key part of the toolset you use to build LLM-powered applications. We also made Tonic Validate Metrics open-source so that it can thrive and evolve with your contributions!Please take it for a spin and let us know what you think in the comments.Docs: <a href="https://docs.tonic.ai/validate" rel="nofollow noreferrer">https://docs.tonic.ai/validate</a>Repo: <a href="https://github.com/TonicAI/tvalmetrics">https://github.com/TonicAI/tvalmetrics</a>Tonic Validate: <a href="https://validate.tonic.ai" rel="nofollow noreferrer">https://validate.tonic.ai</a>

8 comments

d4rkp4tternover 1 year ago

Related — are there any good end to end benchmark datasets for RAG? End to end meaning not just (context, question, answer) tuples (which ignores retrieval) but (Document , question, answer). I know NQ (Natural Questions) is one such dataset:<a href="https://ai.google.com/research/NaturalQuestions" rel="nofollow noreferrer">https://ai.google.com/research/NaturalQuestions</a>But I do t see this dataset mentioned much in RAG discussions.

评论 #38028088 未加载

elyaseover 1 year ago

How does it compare to <a href="https://github.com/explodinggradients/ragas">https://github.com/explodinggradients/ragas</a>

评论 #38014695 未加载

Ephil012over 1 year ago

Hi all, if anyone has any questions about the open source library, Joe and I will be around today to answer them.

评论 #38015025 未加载

rwojoover 1 year ago

This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5/4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?

评论 #38015054 未加载

评论 #38014017 未加载

capybaraover 1 year ago

This is cool. What are your plans for supporting and building upon this going forward?

评论 #38015884 未加载

HenryBemisover 1 year ago

So disapointed :( I saw Metric and RAG (I thought it would be Red-Amber-Green) and I was hoping for some cool metrics/heatmap thingie...I wish you the best though!

agautscover 1 year ago

if you build a dataset of question with responses to test you rag app with this metrics package, how do you know whether the distribution of questions match in any way with the distribution of question you'll get from the app in production? using a hand made dataset of questions and responses could introduce a lot of bias into your rag app.

评论 #38014160 未加载

yukichiover 1 year ago

very cool! looking forward to trying it

Show HN: Tonic Validate Metrics – an open-source RAG evaluation metrics package