Operationalizing large language models (LLMs) is challenging, mainly due to their unpredictable behaviors and potentially catastrophic failures. A few of the essential requirements for productionizing LLM applications are evaluating them for a variety of checks as well as monitoring their performance in production.<p>This process carries two major friction points:<p>1. Identifying which evaluations to run is tricky: LLM Prompts are growing in complexity. Prompts nowadays have 50+ instructions with many repetitions. SPADE, developed by researchers at UC Berkeley, HKUST, LangChain, and Columbia University, provides a neat framework to identify the critical instructions against which we should validate LLM responses.<p>It has four key steps:
a. Candidate generation: Using an LLM to generate an over-complete list of evaluations based on prompt diffs<p>b. Filtering redundant evals: Running the evals on sample data to filter out cases where the false failure rates exceed a certain threshold or the evaluation is trivial and always passes.<p>c. Subsumes checks: Check for subsumes, i.e. if two or more evals effectively do the same check. They check this by prompting LLM to construct a case where function one can return True and function two will return False. If such a case can't be built, functions one and 2 are identical, and one can be dropped.<p>d. Using an integer programming optimizer to find the optimal evaluation set with maximum coverage and respect failure, accuracy, and subsumption constraints<p>Their results are impressive. You can look at the SPADE paper for more details: <a href="https://arxiv.org/pdf/2401.03038.pdf" rel="nofollow">https://arxiv.org/pdf/2401.03038.pdf</a><p>2. Running these evaluations reliably is tricky: Recently, using LLMs as evaluators has emerged as a promising alternative to human evaluations and has proven quite effective in improving the accuracy of LLM applications. However, difficulties still exist when running these evals reliably, i.e. high correlation with human judgments and stability across multiple runs. UpTrain is an open-source framework for evaluating LLM applications that provide high-quality scores. It allows one to define custom evaluations via GuidelineAdherence check, where one can determine any custom guideline in plain English and check if the LLM follows it. Additionally, it provides an easy interface to run these evaluations on production responses with a single API call. This allows one to systematically leverage frameworks like UpTrain to check for wrong LLM outputs.<p>I am one of the maintainers of UpTrain, and we recently integrated the SPADE framework into our open-source repo (<a href="https://github.com/uptrain-ai/uptrain/">https://github.com/uptrain-ai/uptrain/</a>). The idea is simple:<p>1. Provide your prompt template<p>2. We use the SPADE framework to identify which evaluations to run<p>3. We configure UpTrain to run these evaluations on any provided data or monitor these scores in production.
All done seamlessly<p>I would love for you to check it out and provide feedback.<p>Link for the integration tutorial: <a href="https://github.com/uptrain-ai/uptrain/blob/main/examples/integrations/spade/evaluating_guidelines_generated_by_spade.ipynb">https://github.com/uptrain-ai/uptrain/blob/main/examples/int...</a>