科技回声

Our goal with this project is to build a completely open source, state of the art turn detection model that can be used in any voice AI application.I've been experimenting with LLM voice conversations since GPT-4 was first released. (There's a previous front page Show HN about Pipecat, the open source voice AI orchestration framework I work on. [1])It's been almost two years, and for most of that time, I've been expecting that someone would "solve" turn detection. We all built initial, pretty good 80/20 versions of turn detection on top of VAD (voice activity detection) models. And then, as an ecosystem, we kind of got stuck.A few production applications have recently started using Gemini 2.0 Flash to do context aware turn detection. [2] But because latency is ~500ms, that's a more complicated approach than using a specialized model. The team at LiveKit released an open weights model that does text-based turn detection. [3] I was really excited to see that, but I'm not super-optimistic that a text-input model will ever be good enough for this task. (A good rule of thumb in deep learning is that you should bet on end-to-end.)So ... I spent Christmas break training several little proof of concept models, and experimenting with generating synthetic audio data. So, so, so much fun. The results were promising enough that I nerd-sniped a few friends and we started working in earnest on this.The model now performs really well on a subset of turn detection tasks. Too well, really. We're overfitting on a not-terribly-broad initial data set of about 8,000 samples. Getting to this point was the initial bar we set for doing a public release and seeing if other people want to get involved in the project.There are lots of ways to contribute. [4]Medium-term goals for the project are:<pre><code> - Support for a wide range of languages - Inference time of <50ms on GPU and <500ms on CPU - Much wider range of speech nuances captured in training data - A completely synthetic training data pipeline. (Maybe?) - Text conditioning of the model, to support "modes" like credit card, telephone number, and address entry. </code></pre> If you're interested in voice AI or in audio model ML engineering, please try the model out and see what you think. I'd love to hear your thoughts and ideas.[1] <a href="https://news.ycombinator.com/item?id=40345696">https://news.ycombinator.com/item?id=40345696</a>[2] <a href="https://x.com/kwindla/status/1870974144831275410" rel="nofollow">https://x.com/kwindla/status/1870974144831275410</a>[3] <a href="https://blog.livekit.io/using-a-transformer-to-improve-end-of-turn-detection/" rel="nofollow">https://blog.livekit.io/using-a-transformer-to-improve-end-o...</a>[4] <a href="https://github.com/pipecat-ai/smart-turn#things-to-do">https://github.com/pipecat-ai/smart-turn#things-to-do</a>

10 条评论

pzo2 个月前

I will have a look at this. Played with pipecat before and it's great, switched to sherpa-onnx though since I need something that compile to native and can run on edge devices.I'm not sure if turn detection can be really solved except dedicated push to talk button like in walkie-talkie. I often tried google translator app and the problem is in many times when you speaking longer sentence you will stop or slow down a little to gather thought before continuing talking (especially if you are not native speaker). For this reason I avoid converation mode in such cases like google translator and when using perplexity app I prefer the push to talk button mode instead of new one.I think this could be solved but we would need not only low latency turn detection but also low latency speech interruption detection and also very fast low latency llm on device. And in case we have interruption good recovery that system know we continue last sentence instead of discarding previous audio and starting new etc.Lots of things can be improved also regarding i/o latency, like using low latency audio api, very short audio buffer, dedicated audio category and mode (in iOS), using wired headsets instead of buildin speaker, turning off system processing like in iphone audio boosting or polar pattern. And streaming mode for all STT, transport (using using remote LLM), TTS. Not sure if we can have TTS in streaming mode. I think most of the time they split by sentence.I think push to talk is a good solution if well designed: big button in place easily reached with your thumb, integration with iphone action button, using haptic for feedback, using apple watch as big push button, etc.

评论 #43298561 未加载

kwindla2 个月前

A couple of interesting updates today:- 100ms inference using CoreML: <a href="https://x.com/maxxrubin_/status/1897864136698347857" rel="nofollow">https://x.com/maxxrubin_/status/1897864136698347857</a>- An LSTM model (1/7th the size) trained on a subset of the data: <a href="https://github.com/pipecat-ai/smart-turn/issues/1">https://github.com/pipecat-ai/smart-turn/issues/1</a>

foundzen2 个月前

I got most of my answers from the README. Well written. I read most of it. Can you share what kind of resources (and how much of them) were required to fine tune Wav2Vec2-BERT?

评论 #43287423 未加载

remram2 个月前

Ok what's turn detection?

评论 #43287164 未加载

评论 #43287163 未加载

评论 #43287779 未加载

xp842 个月前

I'm excited to see this particular technology developing more. From the absolute worst speech systems such as Siri, who will happily interrupt to respond with nonsense at the slightest half-pause, to even ChatGPT voice mode which at least tries, we haven't yet successfully got computers to do a good job of this - and I feel it may be the biggest obstacle in making 'agents' that are competent at completing simple but useful tasks. There are so many situations where humans "just know" when someone hasn't yet completed a thought, but "AI" still struggles, and those errors can just destroy the efficiency of a conversation or worse, lead to severe errors in function.

zamalek2 个月前

As an [diagnosed] HF autistic person, this is unironically something I would go for in an earpiece. How many parameters is the model?

评论 #43286789 未加载

written-beyond2 个月前

Having reviewed a few turn based models your implementation is pretty inline with them. Excited to see how this matures!

评论 #43286791 未加载

prophesi2 个月前

I'd love for Vedal to incorporate this in Neuro-sama's model. An osu bot turned AI Vtuber[0].[0] <a href="https://www.youtube.com/shorts/eF6hnDFIKmA" rel="nofollow">https://www.youtube.com/shorts/eF6hnDFIKmA</a>

lostmsu2 个月前

Does this support multiple speakers?

评论 #43292316 未加载

cyberbiosecure2 个月前

forking...

10 条评论

pzo2 个月前

评论 #43298561 未加载

kwindla2 个月前

foundzen2 个月前

I got most of my answers from the README. Well written. I read most of it. Can you share what kind of resources (and how much of them) were required to fine tune Wav2Vec2-BERT?

评论 #43287423 未加载

remram2 个月前

Ok what's turn detection?

评论 #43287164 未加载

评论 #43287163 未加载

评论 #43287779 未加载

xp842 个月前

zamalek2 个月前

As an [diagnosed] HF autistic person, this is unironically something I would go for in an earpiece. How many parameters is the model?

评论 #43286789 未加载

written-beyond2 个月前

Having reviewed a few turn based models your implementation is pretty inline with them. Excited to see how this matures!

评论 #43286791 未加载

prophesi2 个月前

lostmsu2 个月前

Does this support multiple speakers?

评论 #43292316 未加载

cyberbiosecure2 个月前

forking...

Show HN: Open-source, native audio turn detection model

10 条评论

Show HN: Open-source, native audio turn detection model

10 条评论