TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: How to unit test AI responses?

10 点作者 bikamonki大约 1 个月前
I am tasked to build a customer support chat. The AI should be trained with company docs. How can I be sure the AI will not hallucinate a bad response to a customer?

5 条评论

senordevnyc大约 1 个月前
You need evals. I found this post extremely helpful in building out a set of evals for my AI product: <a href="https:&#x2F;&#x2F;hamel.dev&#x2F;blog&#x2F;posts&#x2F;evals&#x2F;" rel="nofollow">https:&#x2F;&#x2F;hamel.dev&#x2F;blog&#x2F;posts&#x2F;evals&#x2F;</a>
评论 #43692514 未加载
PeterStuer大约 1 个月前
Your question is very general. &quot;A customer support app&quot; can mean many things from a faq to a case management interface.<p>If you 100% can not tolerate &quot;bad&quot; answers, only use the LLM in the front end to map the user&#x27;s input onto a set of templated questions with templated answers. In the worst case, the user gets a right answer to the wrong question.
评论 #43692297 未加载
jdlshore大约 1 个月前
You can’t (practically) unit test LLM responses, at least not in the traditional sense. Instead, you do runtime validation with a technique called “LLM as judge.”<p>This involves having another prompt, and possibly another model, evaluate the quality of the first response. Then you write your code to try again in a loop and raise an alert if it keeps failing.
jackchina大约 1 个月前
To ensure the AI doesn&#x27;t hallucinate bad responses, focus on the following steps:<p>Quality Training Data: Train the model on high-quality, up-to-date company documents, ensuring it reflects accurate information.<p>Fine-tuning: Regularly fine-tune the model on specific support use cases and real customer interactions.<p>Feedback Loops: Implement a system for human oversight where support agents can review and correct the AI&#x27;s responses.<p>Context Awareness: Design the system to ask clarifying questions if uncertain, avoiding direct false information.<p>Monitoring: Continuously monitor and evaluate the AI’s performance to catch and address any issues promptly.
mfalcon大约 1 个月前
You don&#x27;t. You have to separate concers between deterministic and stochastic code input&#x2F;output. You need evals for the stochastic and mocking when the stochastic output is consumed in the deterministic code.