Launch HN: Hamming (YC S24) – Automated Testing for Voice Agents

129 pointsby sumanyusharma9 months ago

Hi HN! Sumanyu and Marius here from Hamming (<a href="https://www.hamming.ai">https://www.hamming.ai</a>). Hamming lets you automatically test your LLM voice agent. In our interactive demo, you play the role of the voice agent, and our agent will play the role of a difficult end user. We'll then score your performance on the call. Try it here: <a href="https://app.hamming.ai/voice-demo">https://app.hamming.ai/voice-demo</a> (no signup needed). In practice, our agents call your agent!LLM voice agents currently require a lot of iteration and tuning. For example, one of our customers is building an LLM drive-through voice agent for fast food chains. Their KPI is order accuracy. It's crucial for their system to gracefully handle dietary restrictions like allergies and customers who get distracted or otherwise change their minds mid-order. Mistakes in this context could lead to unhappy customers, potential health risks, and financial losses.How do you make sure that such a thing actually works? Most teams spend hours calling their voice agent to find bugs, change the prompt or function definitions, and then call their voice agent again to ensure they fixed the problem and didn't create regressions. This is slow, ad hoc, and feels like a waste of time. In other areas of software development, automated testing has already eliminated this kind of repetitive grunt work — so why not here, too?We were initially working on helping users create evals for prompts & LLM pipelines for a few months but noticed two things:1) Many of our friends were building LLM voice agents.2) They were spending too much time on manual testing.This gave us evidence that there will be more voice companies in the future, and they will need something to make the iteration process easier. We decided to build it!Our solution involves four steps:(1) Create diverse but realistic user personas and scenarios covering the expected conversation space. We create these ourselves for each of our customers. Getting LLMs to create diverse scenarios even with high temperatures is surprisingly tricky. We're learning a lot of tricks along the way to create more randomness and more faithful role-play from the folks at <a href="https://www.reddit.com/r/LocalLLaMA/" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/</a>.(2) Have our agents call your agent when we test your agent's ability to handle things like background noise, long silences, or interruptions. Or have us test just the LLM / logic layer (function calls, etc.) via an API hook.(3) We score the outputs for each conversation using deterministic checks and LLM judges tailored to the specific problem domain (e.g., order accuracy, tone, friendliness). An LLM judge reviews the entire conversation transcript (including function calls and traces) against predefined success criteria, using examples of both good and bad transcripts as references. It then provides a classification output and detailed reasoning to justify its decisions. Building LLM judges that consistently align with human preferences is challenging, but we're improving with each judge we manually develop.(4) Re-use the checks and judges above to score production traffic and use it to track quality metrics in production. (i.e., online evals)We created a Loom recording showing our customers' logged-in experience. We cover how you store and manage scenarios, how you can trigger an experiment run, and how we score each transcript. See the video here: <a href="https://www.loom.com/share/839fe585aa1740c0baa4faa33d772d3e" rel="nofollow">https://www.loom.com/share/839fe585aa1740c0baa4faa33d772d3e</a>We're inspired by our experiences at Tesla, where Sumanyu led growth initiatives as a data scientist, and Anduril, where Marius headed a data infrastructure team. At both companies, simulations were key to testing autonomous systems before deployment. A common challenge, however, was that simulations often fell short of capturing real-world complexity, resulting in outcomes that didn't always translate to reality. In voice testing, we're optimistic about overcoming this issue. With tools like PlayHT and ElevenLabs, we can generate highly realistic voice interactions, and by integrating LLMs that exhibit human-like reasoning, we hope our simulations will closely replicate how real users interact with voice agents.For now, we're manually onboarding and activating each user. We're working hard to make it self-serve in the next few weeks. The demo at <a href="https://app.hamming.ai/voice-demo">https://app.hamming.ai/voice-demo</a> doesn't require any signup, though!Our current pricing is a mix of usage and the number of seats: <a href="https://hamming.ai/pricing">https://hamming.ai/pricing</a>. We don't use customer data for training purposes or to benefit other customers, and we don't sell any data. We use PostHog to track usage. We're in the process of getting HIPAA compliance, with SOC 2 being next on the list.Looking ahead, we're focused on making scenario generation and LLM judge creation more automated and self-serve. We also want to create personas based on real production conversations to make it easier to ‘replay’ a user on demand.A natural next step beyond testing is optimization. We're considering building a voice agent optimizer (like DSPy) that uses scenarios from testing that failed to generate a new set of prompts or function call definitions to make the scenario pass. We find the potential of self-play and self-improvement here super exciting.We'd love to hear about your experiences with voice agents, whether as a user or someone building them. If you're building in the voice or agentic space, we're curious about what is working well for you and what challenges you are encountering. We're eager to learn from your insights about setting up evals and simulation pipelines or your thoughts on where this space is heading.

17 comments

themacguffinman9 months ago

AI voice agents are weird to me because voice is already a very inefficient and ambiguous medium, the only reason I would make a voice call is to talk to a human who is equipped to tackle the ambiguous edge cases that the engineers didn't already anticipate.If you're going to develop AI voice agents to tackle pre-determined cases, why wouldn't you just develop a self-serve non-voice UI that's way more efficient? Why make your users navigate a nebulous conversation tree to fulfill a programmable task?Personally when I realize I can only talk to a bot, I lose interest and end the call. If I wanted to do something routine, I wouldn't have called.

评论 #41261072 未加载

评论 #41259667 未加载

评论 #41266836 未加载

评论 #41259435 未加载

neilk9 months ago

Why “Hamming”? As in Richard Hamming, ex-Bell Labs, “You and Your Research”?

评论 #41258744 未加载

pj_mukh9 months ago

My 2.5 year old yesterday starting saying "Hey, This is a test, Can you hear me?", parroting me spending hours testing my LLM. Hah.This will work with a <a href="https://www.pipecat.ai" rel="nofollow">https://www.pipecat.ai</a> type system? Would love to wrap a continuous testing system with my bot.

评论 #41262992 未加载

zebomon9 months ago

As someone whose job has been negatively impacted by LLMs already, I'll echo the sentiment here that use cases like this one are sort of depressing, as they will primarily impact people who work long hours for small pay. It certainly seems like there's money to be made in this, so congratulations. The landing page is clear and inviting as well. I think I understand what my workflow inside it would be like based on your text and images.I'm most excited to see well-done concepts in this space, though, as I hope it means we're fast-forwarding past this era to one in which we use AI to do new things for people and not just do old things more cheaply. There's undeniably value in the latter but I can't shake the feeling that the short-term effects are really going to sting for some low-income people who can only hope that the next wave of innovations will benefit them too.

评论 #41259387 未加载

diwank9 months ago

Congratulations for the launch! We had a big QC need for <a href="https://kea.ai/" rel="nofollow">https://kea.ai/</a> where we needed to stress test our CX agents in real time too. This would be a big life saver. kudos on the product and the brilliant demo!

评论 #41261068 未加载

atyro9 months ago

Nice! Great to see the UI looks clean enough that it's accessible to non-engineers. The prompt management and active monitoring combo looks especially useful. Been looking for something with this combo for an expense app we're building.

评论 #41259168 未加载

serjester9 months ago

I feel like the better positioning would be evals for voice agents. It seems just as challenging to figure out all the ways your system can go wrong, as it is to build the system in the first place. Doing this in a way that actually adds value without any domain expertise, seems impossible.If it did, wouldn't all the companies with production AI text interfaces be using similar techniques? Now being able to easily replay a conversation that was recorded with a real user seems like a huge value add.

评论 #41259362 未加载

euvin9 months ago

The idea of testing an agent with annoying situations, like uncooperative people or vague responses, makes me wonder if, in the future, similar approaches might be tried on humans. People could be (unknowingly) subjected to automated "social benchmarks" with artificially designed situations, which I'm sure I don't have to explain how dystopian that is.It would essentially be another form of a behavioral interview. I wonder if this exists already, in some form?

评论 #41259057 未加载

telecomhacker9 months ago

I work in the telecom space. I don't think this paradigm will get adopted in the near future. Customers are already building voice bots on top of Google Dialogflow e.g. Cognigy. Cognigy does have LLM capabilities, but it is not widely adopted. I think voice bots will still have to be manually configured for some time.

评论 #41260311 未加载

xan_ps0079 months ago

is there an open source variant available? I am building <a href="https://github.com/bolna-ai/bolna">https://github.com/bolna-ai/bolna</a> which is an open source voice orchestration.would love to have something like this integrated as part of our open source stack.

评论 #41259480 未加载

rstocker999 months ago

That drive through customer… oh my. I have new found empathy for drive through operators.

评论 #41258810 未加载

bazlan9 months ago

As someone who has worked in TTS for over 4 years now. I can tell you that evaluation is the most difficult aspect of generative audio ML.How will this really check that the models are performing well vs just listening?

评论 #41270016 未加载

prithvi249 months ago

This is great to see. Evals on voice are hard - we only have evals on text based prompting, but it doesn't fully capture everything. Excited to give this a try.

评论 #41260230 未加载

kinard9 months ago

I'm working on AI voice agents here in the UK for real estate professionals, unfortunately I couldn't try your service.

评论 #41261127 未加载

评论 #41270983 未加载

评论 #41261576 未加载

vizhang929 months ago

Awesome work guys! Which industries / jobs do you suspect will be adopting voice agents the fastest?

评论 #41262411 未加载

meiraleal9 months ago

There is not even one reliable and proven "voice agent" yet (correct me if I'm wrong but the best available, elevenlabs, isn't that great yet to be a voice agent) but there is already companies selling the test of voice agents?Selling shovels on a gold rush seems to have become the only one mantra here.

评论 #41259892 未加载

评论 #41259375 未加载

plurby9 months ago

Wow, gonna test this with my Retell AI agent.

评论 #41258777 未加载