TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Integrating Spade: Synthesizing Assertions for LLMs into My OSS Project

6 点作者 sourabh03agr超过 1 年前

4 条评论

mrraghav611超过 1 年前
Addressing the challenges of operationalizing large language models is no small feat, and your approach with SPADE and UpTrain is both innovative and practical. It's fantastic to see advancements in evaluation techniques and the commitment to making LLM applications more reliable. Kudos for this impactful contribution to the open-source community
shrjain13超过 1 年前
I was just reading this paper, this integration would surely help!
ashish1600超过 1 年前
Does this technique only work for certain prompts?
评论 #39105683 未加载
sourabh03agr超过 1 年前
Operationalizing large language models (LLMs) is challenging, mainly due to their unpredictable behaviors and potentially catastrophic failures. A few of the essential requirements for productionizing LLM applications are evaluating them for a variety of checks as well as monitoring their performance in production.<p>This process carries two major friction points:<p>1. Identifying which evaluations to run is tricky: LLM Prompts are growing in complexity. Prompts nowadays have 50+ instructions with many repetitions. SPADE, developed by researchers at UC Berkeley, HKUST, LangChain, and Columbia University, provides a neat framework to identify the critical instructions against which we should validate LLM responses.<p>It has four key steps: a. Candidate generation: Using an LLM to generate an over-complete list of evaluations based on prompt diffs<p>b. Filtering redundant evals: Running the evals on sample data to filter out cases where the false failure rates exceed a certain threshold or the evaluation is trivial and always passes.<p>c. Subsumes checks: Check for subsumes, i.e. if two or more evals effectively do the same check. They check this by prompting LLM to construct a case where function one can return True and function two will return False. If such a case can&#x27;t be built, functions one and 2 are identical, and one can be dropped.<p>d. Using an integer programming optimizer to find the optimal evaluation set with maximum coverage and respect failure, accuracy, and subsumption constraints<p>Their results are impressive. You can look at the SPADE paper for more details: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2401.03038.pdf" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;pdf&#x2F;2401.03038.pdf</a><p>2. Running these evaluations reliably is tricky: Recently, using LLMs as evaluators has emerged as a promising alternative to human evaluations and has proven quite effective in improving the accuracy of LLM applications. However, difficulties still exist when running these evals reliably, i.e. high correlation with human judgments and stability across multiple runs. UpTrain is an open-source framework for evaluating LLM applications that provide high-quality scores. It allows one to define custom evaluations via GuidelineAdherence check, where one can determine any custom guideline in plain English and check if the LLM follows it. Additionally, it provides an easy interface to run these evaluations on production responses with a single API call. This allows one to systematically leverage frameworks like UpTrain to check for wrong LLM outputs.<p>I am one of the maintainers of UpTrain, and we recently integrated the SPADE framework into our open-source repo (<a href="https:&#x2F;&#x2F;github.com&#x2F;uptrain-ai&#x2F;uptrain&#x2F;">https:&#x2F;&#x2F;github.com&#x2F;uptrain-ai&#x2F;uptrain&#x2F;</a>). The idea is simple:<p>1. Provide your prompt template<p>2. We use the SPADE framework to identify which evaluations to run<p>3. We configure UpTrain to run these evaluations on any provided data or monitor these scores in production. All done seamlessly<p>I would love for you to check it out and provide feedback.<p>Link for the integration tutorial: <a href="https:&#x2F;&#x2F;github.com&#x2F;uptrain-ai&#x2F;uptrain&#x2F;blob&#x2F;main&#x2F;examples&#x2F;integrations&#x2F;spade&#x2F;evaluating_guidelines_generated_by_spade.ipynb">https:&#x2F;&#x2F;github.com&#x2F;uptrain-ai&#x2F;uptrain&#x2F;blob&#x2F;main&#x2F;examples&#x2F;int...</a>