Show HN: Real-time AI Voice Chat at ~500ms Latency

509 点作者 koljab3 天前

35 条评论

I built RealtimeVoiceChat because I was frustrated with the latency in most voice AI interactions. This is an open-source (MIT license) system designed for real-time, local voice conversations with LLMs.Quick Demo Video (50s): <a href="https://www.youtube.com/watch?v=HM_IQuuuPX8" rel="nofollow">https://www.youtube.com/watch?v=HM_IQuuuPX8</a>The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.The code is here: <a href="https://github.com/KoljaB/RealtimeVoiceChat">https://github.com/KoljaB/RealtimeVoiceChat</a>

评论 #43899391 未加载

评论 #43900886 未加载

评论 #43900219 未加载

评论 #43899129 未加载

评论 #43901398 未加载

评论 #43903846 未加载

评论 #43902665 未加载

评论 #43901323 未加载

评论 #43910324 未加载

评论 #43899238 未加载

sabellito2 天前

Every time I see these things, they look cool as hell, I get excited, then I try to get them working on my gaming PC (that has the GPU), I spend 1-2h fighting with python and give up.Today's issue is that my python version is 3.12 instead of <3.12,>=3.9. Installing python 3.11 from the official website does nothing, I give up. It's a shame that the amazing work done by people like the OP gets underused because of this mess outside of their control."Just use docker". Have you tried using docker on windows? There's a reason I never do dev work on windows.I spent most of my career in the JVM and Node, and despite the issues, never had to deal with this level of lack of compatibility.

评论 #43907189 未加载

评论 #43903910 未加载

评论 #43903914 未加载

评论 #43915080 未加载

评论 #43904934 未加载

评论 #43905050 未加载

评论 #43907488 未加载

评论 #43905927 未加载

评论 #43909523 未加载

评论 #43909023 未加载

评论 #43908237 未加载

评论 #43907491 未加载

smusamashah3 天前

Saying this as a user of these tools (openai, Google voice chat etc). These are fast yes, but they don't allow talking naturally with pauses. When we talk, we take long and small pauses for thinking or for other reasons.With these tools, AI starts taking as soon as we stop. Happens both in text and voice chat tools.I saw a demo on twitter a few weeks back where AI was waiting for the person to actually finish what he was saying. Length of pauses wasn't a problem. I don't how complex that problem is though. Probably another AI needs to analyse the input so far a decide if it's a pause or not.

评论 #43901229 未加载

评论 #43899559 未加载

评论 #43899301 未加载

评论 #43901845 未加载

评论 #43905256 未加载

评论 #43901556 未加载

评论 #43899202 未加载

评论 #43899244 未加载

评论 #43904294 未加载

评论 #43899410 未加载

评论 #43899247 未加载

评论 #43901032 未加载

joshstrange3 天前

This is very, very cool! The interrupting was a "wow" moment for me (I know it's not "new new" but to see it so well done in open source was awesome).Question about the Interrupt feature, how does it handle "Mmk", "Yes", "Of course", "cough", etc? Aside from the sycophancy from OpenAI's voice chat (no, not every question I ask is a "great question!") I dislike that a noise sometimes stops the AI from responding and there isn't a great way to get back on track, to pick up where you left off.It's a hard problem, how do you stop replying quickly AND make sure you are stopping for a good reason?

评论 #43899595 未加载

jedberg3 天前

I did some research into this about a year ago. Some fun facts I learned:- The median delay between speakers in a human to human conversation is zero milliseconds. In other words, about 1/2 the time, one speaker interrupts the other, making the delay negative.- Humans don't care about delays when speaking to known AIs. They assume the AI will need time to think. Most users will qualify a 1000ms delay is acceptable and a 500ms delay as exceptional.- Every voice assistant up to that point (and probably still today) has a minimum delay of about 300ms, because they all use silence detection to decide when to start responding, and you need about 300ms of silence to reliably differentiate that from a speaker's normal pause- Alexa actually has a setting to increase this wait time for slower speakers.You'll notice in this demo video that the AI never interrupts him, which is what makes it feel like a not quite human interaction (plus the stilted intonations of the voice).Humans appear to process speech in a much more steaming why, constantly updating their parsing of the sentence until they have a high enough confidence level to respond, but using context clues and prior knowledge.For a voice assistant to get the "human" levels, it will have to work more like this, where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning.

评论 #43900093 未加载

评论 #43899849 未加载

评论 #43899956 未加载

评论 #43899880 未加载

评论 #43900325 未加载

评论 #43899934 未加载

评论 #43900210 未加载

评论 #43899926 未加载

评论 #43900587 未加载

评论 #43899869 未加载

评论 #43899774 未加载

评论 #43903937 未加载

评论 #43900148 未加载

评论 #43900092 未加载

评论 #43900558 未加载

评论 #43900879 未加载

lhl3 天前

Maybe of interest, I built and open-sourced a similar (web-based) end-to-end voice project last year for an AMD Hackathon: <a href="https://github.com/lhl/voicechat2">https://github.com/lhl/voicechat2</a>As a submission for an AMD Hackathon, one big thing is that I tested all the components to work with RDNA3 cards. It's built to allow for swappable components for the SRT, LLM, TTS (the tricky stuff was making websockets work and doing some sentence-based interleaving to lower latency).Here's a full write up on the project: <a href="https://www.hackster.io/lhl/voicechat2-local-ai-voice-chat-4c48f2" rel="nofollow">https://www.hackster.io/lhl/voicechat2-local-ai-voice-chat-4...</a>(I've don't really have time to maintain that project, but it can be a good starting point for anyone that's looking to hack their own thing together.)

kabirgoel3 天前

This is great. Poking into the source, I find it interesting that the author implemented a custom turn detection strategy, instead of using Silero VAD (which is standard in the voice agents space). I’m very curious why they did it this way and what benefits they observed.For folks that are curious about the state of the voice agents space, Daily (the WebRTC company) has a great guide [1], as well as an open-source framework that allows you to build AI voice chat similar to OP's with lots of utilities [2].Disclaimer: I work at Cartesia, which services a lot of these voice agents use cases, and Daily is a friend.[1]: <a href="https://voiceaiandvoiceagents.com" rel="nofollow">https://voiceaiandvoiceagents.com</a> [2]: <a href="https://docs.pipecat.ai/getting-started/overview" rel="nofollow">https://docs.pipecat.ai/getting-started/overview</a>

评论 #43900517 未加载

krick3 天前

Cool for a weekend project, but honestly ChatGPT is still kinda shit at dialogues. I wonder if that's the issue with technology or OpenAI's fine-tuning (and suspect the latter), but it cannot talk like normal people do: shut up if it has nothing to add of value, ask reasonable follow-up questions if user doesn't understand something or there's ambiguity in the question. Also, on topic of follow-up questions: I don't remember which update introduced that attempt to increase engagement by finishing every post with stupid irrelevant follow-up question, but it's really annoying. It also works on me, despite hating ChatGPT it's kinda an instinct to treat humanly something that speaks vaguely like a human.

评论 #43902757 未加载

评论 #43901887 未加载

briga3 天前

I'm starting to feel like LLMs need to be tuned for shorter responses. For every short sentence you give them they outputs paragraphs of text. Sometimes it's even good text, but not every input sentence needs a mini-essay in response.Very cool project though. Maybe you can fine tune the prompt to change how chatty your AI is.

regularfry2 天前

We really, really need something to take Whisper's crown for streaming. Faster-whisper is great, but Whisper itself was never built for real-time use.For this demo to be real-time, it relies on having a beefy enough GPU that it can push 30 seconds of audio through one of the more capable (therefore bigger) models in a couple of hundred milliseconds. It's basically throwing hardware at the problem to paper over the fact that Whisper is just the wrong architecture.Don't get me wrong, it's great where it's great, but that's just not streaming.

cannonpr3 天前

Kind of surprised nobody has brought up <a href="https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo" rel="nofollow">https://www.sesame.com/research/crossing_the_uncanny_valley_...</a>It interacts nearly like a human, can and does interrupt me once it has enough context in many situations, and has exceedingly low levels of latency, using for the first time was a fairly shocking experience for me.

评论 #43900002 未加载

glossardi大约 5 小时前

I need this for my next app! Great job, thank you for sharing!

hegemon83 天前

This is an impressive project—great work! I’m curious anyone has came across similar work, but for multi-lingual voice agents, especially those that handle non-English languages and English + X well.Does a Translation step right after the ASR step make sense at all?Any pointers—papers, repos —would be appreciated!

znpy3 天前

This kind of thing immediately made me think about the 512gb mac studio. If this works as good on that hardware as it does on the recommended nvidia cards, then the $15k is not much the price of the hardware but rather the price of having a full conversational at home, private.

评论 #43901192 未加载

评论 #43902116 未加载

ConfusedDog2 天前

That's a big improvement over Siri tbh (interruption and latency), but Siri's answer generally kind of shorter than this. My general experience with Siri hasn't been great lately. For complex question, it just redirect to ChatGPT with an extra step for me to confirm. Often stops listening when I'm not even finished with my sentence, and gives "I don't know anything about that" way too often.

IshKebab3 天前

Impressive! I guess the speech synthesis quality is the best available open source at the moment?The endgame of this is surely a continuously running wave to wave model with no text tokens at all? Or at least none in the main path.

评论 #43899283 未加载

fintechie3 天前

Quite good, it would sound much better with SOTA voices though:<a href="https://github.com/nari-labs/dia">https://github.com/nari-labs/dia</a>

评论 #43899908 未加载

评论 #43900059 未加载

XCSme2 天前

The demo reminded me of this amazing post: <a href="https://sambleckley.com/writing/church-of-interruption.html" rel="nofollow">https://sambleckley.com/writing/church-of-interruption.html</a>

grav3 天前

In the demo, is there any specific reason that the voice doesn't go "up" in pitch when asking questions? Even the (many) rethorical questions would in my view improve by having a bit of a pitch change before the question mark.

评论 #43904724 未加载

rane2 天前

What are currently the best options for low latency TTS and STT as external services? If you want to host an app with these capabilities on a VPS, anything that requires a GPU doesn't seem feasible.

SillyUsername3 天前

Have you considered using Dia for the TTS? I believe this is currently "best in class" <a href="https://github.com/nari-labs/dia">https://github.com/nari-labs/dia</a>

jak02 天前

Can this be tweaked somehow to try to reproduce the experience of Aqua Voice? <a href="https://withaqua.com/">https://withaqua.com/</a>

oldgregg3 天前

Nice work, I like the lightweight web front end and your implementation of VAD.

breaker-kind3 天前

why is your AI chatbot talking in a bizarre attempt at AAVE?

评论 #43899517 未加载

评论 #43899770 未加载

tintor3 天前

After interrupt, unspoken words from LLM are still in the chat window. Is LLM even aware that it was interrupted and where exactly?

评论 #43900599 未加载

singularity20012 天前

Voice in Text out is the way to go except for very simple Use Cases / questions.

lacoolj3 天前

Call me when the AI can interrupt YOU :)

评论 #43901133 未加载

评论 #43901203 未加载

评论 #43900620 未加载

bufferoverflow3 天前

It's fast, but it doesn't sound good. Many voice chat AIs are way ahead and sound natural.

purplezooey3 天前

I had been working on something like it when I came across this. Excellent work. Love the demo.

blueblisters2 天前

Does this work for simultaneous multiple clients at the same endpoint?

28093 天前

Looks neat. Be good to get AMD/Intel support of course.

orliesaurus3 天前

added a star because the revolution will come from these repos - thank you Author for working on this in the open!

dcreater3 天前

Does the docker container work on Mac?

评论 #43899972 未加载

stevage3 天前

Hell yeah, exactly.

nitrogen993 天前

Will this work on a Raspberry Pi?

评论 #43903339 未加载