Our goal with this project is to build a completely open source, state of the art turn detection model that can be used in any voice AI application.<p>I've been experimenting with LLM voice conversations since GPT-4 was first released. (There's a previous front page Show HN about Pipecat, the open source voice AI orchestration framework I work on. [1])<p>It's been almost two years, and for most of that time, I've been expecting that someone would "solve" turn detection. We all built initial, pretty good 80/20 versions of turn detection on top of VAD (voice activity detection) models. And then, as an ecosystem, we kind of got stuck.<p>A few production applications have recently started using Gemini 2.0 Flash to do context aware turn detection. [2] But because latency is ~500ms, that's a more complicated approach than using a specialized model. The team at LiveKit released an open weights model that does text-based turn detection. [3] I was really excited to see that, but I'm not super-optimistic that a text-input model will ever be good enough for this task. (A good rule of thumb in deep learning is that you should bet on end-to-end.)<p>So ... I spent Christmas break training several little proof of concept models, and experimenting with generating synthetic audio data. So, so, so much fun. The results were promising enough that I nerd-sniped a few friends and we started working in earnest on this.<p>The model now performs really well on a subset of turn detection tasks. Too well, really. We're overfitting on a not-terribly-broad initial data set of about 8,000 samples. Getting to this point was the initial bar we set for doing a public release and seeing if other people want to get involved in the project.<p>There are lots of ways to contribute. [4]<p>Medium-term goals for the project are:<p><pre><code> - Support for a wide range of languages
- Inference time of <50ms on GPU and <500ms on CPU
- Much wider range of speech nuances captured in training data
- A completely synthetic training data pipeline. (Maybe?)
- Text conditioning of the model, to support "modes" like credit card, telephone number, and address entry.
</code></pre>
If you're interested in voice AI or in audio model ML engineering, please try the model out and see what you think. I'd love to hear your thoughts and ideas.<p>[1] <a href="https://news.ycombinator.com/item?id=40345696">https://news.ycombinator.com/item?id=40345696</a><p>[2] <a href="https://x.com/kwindla/status/1870974144831275410" rel="nofollow">https://x.com/kwindla/status/1870974144831275410</a><p>[3] <a href="https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection/" rel="nofollow">https://blog.livekit.io/using-a-transformer-to-improve-end-o...</a><p>[4] <a href="https://github.com/pipecat-ai/smart-turn#things-to-do">https://github.com/pipecat-ai/smart-turn#things-to-do</a>