TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How to unit test AI responses?

10 pointsby bikamonki29 days ago
I am tasked to build a customer support chat. The AI should be trained with company docs. How can I be sure the AI will not hallucinate a bad response to a customer?

5 comments

senordevnyc29 days ago
You need evals. I found this post extremely helpful in building out a set of evals for my AI product: <a href="https:&#x2F;&#x2F;hamel.dev&#x2F;blog&#x2F;posts&#x2F;evals&#x2F;" rel="nofollow">https:&#x2F;&#x2F;hamel.dev&#x2F;blog&#x2F;posts&#x2F;evals&#x2F;</a>
评论 #43692514 未加载
PeterStuer29 days ago
Your question is very general. &quot;A customer support app&quot; can mean many things from a faq to a case management interface.<p>If you 100% can not tolerate &quot;bad&quot; answers, only use the LLM in the front end to map the user&#x27;s input onto a set of templated questions with templated answers. In the worst case, the user gets a right answer to the wrong question.
评论 #43692297 未加载
jdlshore29 days ago
You can’t (practically) unit test LLM responses, at least not in the traditional sense. Instead, you do runtime validation with a technique called “LLM as judge.”<p>This involves having another prompt, and possibly another model, evaluate the quality of the first response. Then you write your code to try again in a loop and raise an alert if it keeps failing.
jackchina27 days ago
To ensure the AI doesn&#x27;t hallucinate bad responses, focus on the following steps:<p>Quality Training Data: Train the model on high-quality, up-to-date company documents, ensuring it reflects accurate information.<p>Fine-tuning: Regularly fine-tune the model on specific support use cases and real customer interactions.<p>Feedback Loops: Implement a system for human oversight where support agents can review and correct the AI&#x27;s responses.<p>Context Awareness: Design the system to ask clarifying questions if uncertain, avoiding direct false information.<p>Monitoring: Continuously monitor and evaluate the AI’s performance to catch and address any issues promptly.
mfalcon29 days ago
You don&#x27;t. You have to separate concers between deterministic and stochastic code input&#x2F;output. You need evals for the stochastic and mocking when the stochastic output is consumed in the deterministic code.