I built RealtimeVoiceChat because I was frustrated with the latency in most voice AI interactions. This is an open-source (MIT license) system designed for real-time, local voice conversations with LLMs.<p>Quick Demo Video (50s): <a href="https://www.youtube.com/watch?v=HM_IQuuuPX8" rel="nofollow">https://www.youtube.com/watch?v=HM_IQuuuPX8</a><p>The goal is to get closer to natural conversation speed. It uses audio chunk streaming over WebSockets, RealtimeSTT (based on Whisper), and RealtimeTTS (supporting engines like Coqui XTTSv2/Kokoro) to achieve around 500ms response latency, even when running larger local models like a 24B Mistral fine-tune via Ollama.<p>Key aspects: Designed for local LLMs (Ollama primarily, OpenAI connector included). Interruptible conversation. Smart turn detection to avoid cutting the user off mid-thought. Dockerized setup available for easier dependency management.<p>It requires a decent CUDA-enabled GPU for good performance due to the STT/TTS models.<p>Would love to hear your feedback on the approach, performance, potential optimizations, or any features you think are essential for a good local voice AI experience.<p>The code is here: <a href="https://github.com/KoljaB/RealtimeVoiceChat">https://github.com/KoljaB/RealtimeVoiceChat</a>
Every time I see these things, they look cool as hell, I get excited, then I try to get them working on my gaming PC (that has the GPU), I spend 1-2h fighting with python and give up.<p>Today's issue is that my python version is 3.12 instead of <3.12,>=3.9. Installing python 3.11 from the official website does nothing, I give up. It's a shame that the amazing work done by people like the OP gets underused because of this mess outside of their control.<p>"Just use docker". Have you tried using docker on windows? There's a reason I never do dev work on windows.<p>I spent most of my career in the JVM and Node, and despite the issues, never had to deal with this level of lack of compatibility.
Saying this as a user of these tools (openai, Google voice chat etc). These are fast yes, but they don't allow talking naturally with pauses. When we talk, we take long and small pauses for thinking or for other reasons.<p>With these tools, AI starts taking as soon as we stop. Happens both in text and voice chat tools.<p>I saw a demo on twitter a few weeks back where AI was waiting for the person to actually finish what he was saying. Length of pauses wasn't a problem. I don't how complex that problem is though. Probably another AI needs to analyse the input so far a decide if it's a pause or not.
This is very, very cool! The interrupting was a "wow" moment for me (I know it's not "new new" but to see it so well done in open source was awesome).<p>Question about the Interrupt feature, how does it handle "Mmk", "Yes", "Of course", "<i>cough</i>", etc? Aside from the sycophancy from OpenAI's voice chat (no, not every question I ask is a "great question!") I dislike that a noise sometimes stops the AI from responding and there isn't a great way to get back on track, to pick up where you left off.<p>It's a hard problem, how do you stop replying quickly AND make sure you are stopping for a good reason?
I did some research into this about a year ago. Some fun facts I learned:<p>- The median delay between speakers in a human to human conversation is zero milliseconds. In other words, about 1/2 the time, one speaker interrupts the other, making the delay negative.<p>- Humans don't care about delays when speaking to known AIs. They assume the AI will need time to think. Most users will qualify a 1000ms delay is acceptable and a 500ms delay as exceptional.<p>- Every voice assistant up to that point (and probably still today) has a minimum delay of about 300ms, because they all use silence detection to decide when to start responding, and you need about 300ms of silence to reliably differentiate that from a speaker's normal pause<p>- Alexa actually has a setting to increase this wait time for slower speakers.<p>You'll notice in this demo video that the AI never interrupts him, which is what makes it feel like a not quite human interaction (plus the stilted intonations of the voice).<p>Humans appear to process speech in a much more steaming why, constantly updating their parsing of the sentence until they have a high enough confidence level to respond, but using context clues and prior knowledge.<p>For a voice assistant to get the "human" levels, it will have to work more like this, where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning.
Maybe of interest, I built and open-sourced a similar (web-based) end-to-end voice project last year for an AMD Hackathon: <a href="https://github.com/lhl/voicechat2">https://github.com/lhl/voicechat2</a><p>As a submission for an AMD Hackathon, one big thing is that I tested all the components to work with RDNA3 cards. It's built to allow for swappable components for the SRT, LLM, TTS (the tricky stuff was making websockets work and doing some sentence-based interleaving to lower latency).<p>Here's a full write up on the project: <a href="https://www.hackster.io/lhl/voicechat2-local-ai-voice-chat-4c48f2" rel="nofollow">https://www.hackster.io/lhl/voicechat2-local-ai-voice-chat-4...</a><p>(I've don't really have time to maintain that project, but it can be a good starting point for anyone that's looking to hack their own thing together.)
This is great. Poking into the source, I find it interesting that the author implemented a custom turn detection strategy, instead of using Silero VAD (which is standard in the voice agents space). I’m very curious why they did it this way and what benefits they observed.<p>For folks that are curious about the state of the voice agents space, Daily (the WebRTC company) has a great guide [1], as well as an open-source framework that allows you to build AI voice chat similar to OP's with lots of utilities [2].<p>Disclaimer: I work at Cartesia, which services a lot of these voice agents use cases, and Daily is a friend.<p>[1]: <a href="https://voiceaiandvoiceagents.com" rel="nofollow">https://voiceaiandvoiceagents.com</a>
[2]: <a href="https://docs.pipecat.ai/getting-started/overview" rel="nofollow">https://docs.pipecat.ai/getting-started/overview</a>
Cool for a weekend project, but honestly ChatGPT is still kinda shit at dialogues. I wonder if that's the issue with technology or OpenAI's fine-tuning (and suspect the latter), but it cannot talk like normal people do: shut up if it has nothing to add of value, ask <i>reasonable</i> follow-up questions if user doesn't understand something or there's ambiguity in the question. Also, on topic of follow-up questions: I don't remember which update introduced that attempt to increase engagement by finishing every post with stupid irrelevant follow-up question, but it's really annoying. It also works on me, despite hating ChatGPT it's kinda an instinct to treat humanly something that speaks vaguely like a human.
I'm starting to feel like LLMs need to be tuned for shorter responses. For every short sentence you give them they outputs paragraphs of text. Sometimes it's even good text, but not every input sentence needs a mini-essay in response.<p>Very cool project though. Maybe you can fine tune the prompt to change how chatty your AI is.
We really, really need something to take Whisper's crown for streaming. Faster-whisper is great, but Whisper itself was never built for real-time use.<p>For this demo to be real-time, it relies on having a beefy enough GPU that it can push 30 seconds of audio through one of the more capable (therefore bigger) models in a couple of hundred milliseconds. It's basically throwing hardware at the problem to paper over the fact that Whisper is just the wrong architecture.<p>Don't get me wrong, it's great where it's great, but that's just not streaming.
Kind of surprised nobody has brought up <a href="https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice#demo" rel="nofollow">https://www.sesame.com/research/crossing_the_uncanny_valley_...</a><p>It interacts nearly like a human, can and does interrupt me once it has enough context in many situations, and has exceedingly low levels of latency, using for the first time was a fairly shocking experience for me.
This is an impressive project—great work! I’m curious anyone has came across similar work, but for multi-lingual voice agents, especially those that handle non-English languages and English + X well.<p>Does a Translation step right after the ASR step make sense at all?<p>Any pointers—papers, repos —would be appreciated!
This kind of thing immediately made me think about the 512gb mac studio. If this works as good on that hardware as it does on the recommended nvidia cards, then the $15k is not much the price of the hardware but rather the price of having a full conversational at home, private.
That's a big improvement over Siri tbh (interruption and latency), but Siri's answer generally kind of shorter than this. My general experience with Siri hasn't been great lately. For complex question, it just redirect to ChatGPT with an extra step for me to confirm. Often stops listening when I'm not even finished with my sentence, and gives "I don't know anything about that" way too often.
Impressive! I guess the speech synthesis quality is the best available open source at the moment?<p>The endgame of this is surely a continuously running wave to wave model with no text tokens at all? Or at least none in the main path.
Quite good, it would sound much better with SOTA voices though:<p><a href="https://github.com/nari-labs/dia">https://github.com/nari-labs/dia</a>
The demo reminded me of this amazing post: <a href="https://sambleckley.com/writing/church-of-interruption.html" rel="nofollow">https://sambleckley.com/writing/church-of-interruption.html</a>
In the demo, is there any specific reason that the voice doesn't go "up" in pitch when asking questions? Even the (many) rethorical questions would in my view improve by having a bit of a pitch change before the question mark.
What are currently the best options for low latency TTS and STT as external services? If you want to host an app with these capabilities on a VPS, anything that requires a GPU doesn't seem feasible.
Have you considered using Dia for the TTS?
I believe this is currently "best in class" <a href="https://github.com/nari-labs/dia">https://github.com/nari-labs/dia</a>
Can this be tweaked somehow to try to reproduce the experience of Aqua Voice? <a href="https://withaqua.com/">https://withaqua.com/</a>