TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Tonic Validate Metrics – an open-source RAG evaluation metrics package

40 pointsby Ephil012over 1 year ago
Hey HN, Joe and Ethan from Tonic.ai here. We just released a new open-source python package for evaluating the performance of Retrieval Augmented Generation (RAG) systems.<p>Earlier this year, we started developing a RAG-powered app to enable companies to talk to their free-text data safely.<p>During our experimentation, however, we realized that using such a new method meant that there weren’t industry-standards for evaluation metrics to measure the accuracy of RAG performance. We built Tonic Validate Metrics (tvalmetrics, for short) to easily calculate the benchmarks we needed to meet in building our RAG system.<p>We’re sharing this python package with the hope that it will be as useful for you as it has been for us and become a key part of the toolset you use to build LLM-powered applications. We also made Tonic Validate Metrics open-source so that it can thrive and evolve with your contributions!<p>Please take it for a spin and let us know what you think in the comments.<p>Docs: <a href="https:&#x2F;&#x2F;docs.tonic.ai&#x2F;validate" rel="nofollow noreferrer">https:&#x2F;&#x2F;docs.tonic.ai&#x2F;validate</a><p>Repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;TonicAI&#x2F;tvalmetrics">https:&#x2F;&#x2F;github.com&#x2F;TonicAI&#x2F;tvalmetrics</a><p>Tonic Validate: <a href="https:&#x2F;&#x2F;validate.tonic.ai" rel="nofollow noreferrer">https:&#x2F;&#x2F;validate.tonic.ai</a>

8 comments

d4rkp4tternover 1 year ago
Related — are there any good end to end benchmark datasets for RAG? End to end meaning not just (context, question, answer) tuples (which ignores retrieval) but (Document , question, answer). I know NQ (Natural Questions) is one such dataset:<p><a href="https:&#x2F;&#x2F;ai.google.com&#x2F;research&#x2F;NaturalQuestions" rel="nofollow noreferrer">https:&#x2F;&#x2F;ai.google.com&#x2F;research&#x2F;NaturalQuestions</a><p>But I do t see this dataset mentioned much in RAG discussions.
评论 #38028088 未加载
elyaseover 1 year ago
How does it compare to <a href="https:&#x2F;&#x2F;github.com&#x2F;explodinggradients&#x2F;ragas">https:&#x2F;&#x2F;github.com&#x2F;explodinggradients&#x2F;ragas</a>
评论 #38014695 未加载
Ephil012over 1 year ago
Hi all, if anyone has any questions about the open source library, Joe and I will be around today to answer them.
评论 #38015025 未加载
rwojoover 1 year ago
This package suggests building a dataset and then using LLM-assisted evaluation via GPT-3.5&#x2F;4 to evaluate your RAG pipeline on the dataset. It relies heavily on GPT-4 (or an equivalent model) to provide realistic scores. How safe is that approach?
评论 #38015054 未加载
评论 #38014017 未加载
capybaraover 1 year ago
This is cool. What are your plans for supporting and building upon this going forward?
评论 #38015884 未加载
HenryBemisover 1 year ago
So disapointed :( I saw Metric and RAG (I thought it would be Red-Amber-Green) and I was hoping for some cool metrics&#x2F;heatmap thingie...<p>I wish you the best though!
agautscover 1 year ago
if you build a dataset of question with responses to test you rag app with this metrics package, how do you know whether the distribution of questions match in any way with the distribution of question you&#x27;ll get from the app in production? using a hand made dataset of questions and responses could introduce a lot of bias into your rag app.
评论 #38014160 未加载
yukichiover 1 year ago
very cool! looking forward to trying it