Hi all, I had been working with tools like LlamaIndex at work and realized that there were not many good options for monitoring RAG systems for performance. So I built a pretty simple Github Action + PyTest setup to measure the performance of LlamaIndex over time. The setup basically uses a library called Tonic Validate, which scores the quality of your RAG system’s answers (Disclaimer: My current company made Tonic Validate and I am an engineer on Tonic Validate). Using Tonic Validate, it scores the responses from LlamaIndex on a set of test data I created and then uploads it to Tonic Validate’s UI for visualization. If anyone is interested in it, you can find the full source code here [1]. I also wrote a guest blog post on LlamaIndex’s blog about it here [2] if anyone wants more details about how it works.<p>1. https://github.com/TonicAI/llama-validate-demo<p>2. https://blog.llamaindex.ai/tonic-validate-x-llamaindex-implementing-integration-tests-for-llamaindex-43db50b76ed9