科技回声

4 条评论

Addressing the challenges of operationalizing large language models is no small feat, and your approach with SPADE and UpTrain is both innovative and practical. It's fantastic to see advancements in evaluation techniques and the commitment to making LLM applications more reliable. Kudos for this impactful contribution to the open-source community

shrjain13超过 1 年前

I was just reading this paper, this integration would surely help!

ashish1600超过 1 年前

Does this technique only work for certain prompts?

评论 #39105683 未加载

sourabh03agr超过 1 年前

Operationalizing large language models (LLMs) is challenging, mainly due to their unpredictable behaviors and potentially catastrophic failures. A few of the essential requirements for productionizing LLM applications are evaluating them for a variety of checks as well as monitoring their performance in production.This process carries two major friction points:1. Identifying which evaluations to run is tricky: LLM Prompts are growing in complexity. Prompts nowadays have 50+ instructions with many repetitions. SPADE, developed by researchers at UC Berkeley, HKUST, LangChain, and Columbia University, provides a neat framework to identify the critical instructions against which we should validate LLM responses.It has four key steps: a. Candidate generation: Using an LLM to generate an over-complete list of evaluations based on prompt diffsb. Filtering redundant evals: Running the evals on sample data to filter out cases where the false failure rates exceed a certain threshold or the evaluation is trivial and always passes.c. Subsumes checks: Check for subsumes, i.e. if two or more evals effectively do the same check. They check this by prompting LLM to construct a case where function one can return True and function two will return False. If such a case can't be built, functions one and 2 are identical, and one can be dropped.d. Using an integer programming optimizer to find the optimal evaluation set with maximum coverage and respect failure, accuracy, and subsumption constraintsTheir results are impressive. You can look at the SPADE paper for more details: <a href="https://arxiv.org/pdf/2401.03038.pdf" rel="nofollow">https://arxiv.org/pdf/2401.03038.pdf</a>2. Running these evaluations reliably is tricky: Recently, using LLMs as evaluators has emerged as a promising alternative to human evaluations and has proven quite effective in improving the accuracy of LLM applications. However, difficulties still exist when running these evals reliably, i.e. high correlation with human judgments and stability across multiple runs. UpTrain is an open-source framework for evaluating LLM applications that provide high-quality scores. It allows one to define custom evaluations via GuidelineAdherence check, where one can determine any custom guideline in plain English and check if the LLM follows it. Additionally, it provides an easy interface to run these evaluations on production responses with a single API call. This allows one to systematically leverage frameworks like UpTrain to check for wrong LLM outputs.I am one of the maintainers of UpTrain, and we recently integrated the SPADE framework into our open-source repo (<a href="https://github.com/uptrain-ai/uptrain/">https://github.com/uptrain-ai/uptrain/</a>). The idea is simple:1. Provide your prompt template2. We use the SPADE framework to identify which evaluations to run3. We configure UpTrain to run these evaluations on any provided data or monitor these scores in production. All done seamlesslyI would love for you to check it out and provide feedback.Link for the integration tutorial: <a href="https://github.com/uptrain-ai/uptrain/blob/main/examples/integrations/spade/evaluating_guidelines_generated_by_spade.ipynb">https://github.com/uptrain-ai/uptrain/blob/main/examples/int...</a>

4 条评论

mrraghav611超过 1 年前

shrjain13超过 1 年前

I was just reading this paper, this integration would surely help!

ashish1600超过 1 年前

Does this technique only work for certain prompts?

评论 #39105683 未加载

sourabh03agr超过 1 年前

Integrating Spade: Synthesizing Assertions for LLMs into My OSS Project

4 条评论

Integrating Spade: Synthesizing Assertions for LLMs into My OSS Project

4 条评论