TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Real-time voice chat with AI, no transcription

33 点作者 huac大约 1 年前
Hi HN -- voice chat with AI is very popular these days, especially with YC startups (<a href="https:&#x2F;&#x2F;twitter.com&#x2F;k7agar&#x2F;status&#x2F;1769078697661804795" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;k7agar&#x2F;status&#x2F;1769078697661804795</a>). The current approaches all do a cascaded approach, with audio -&gt; transcription -&gt; language model -&gt; text synthesis. This approach is easy to get started with, but requires lots of complexity and has a few glaring limitations. Most notably, transcription is slow, is lossy and any error propagates to the rest of the system, cannot capture emotional affect, is often not robust to code-switching&#x2F;accents, and more.<p>Instead, what if we fed audio directly to the LLM - LLM&#x27;s are really smart, can they figure it out? This approach is faster (we skip transcription decoding) and less lossy&#x2F;more robust because the big language model should be smarter than a smaller transcription decoder.<p>I&#x27;ve trained a model in just that fashion. For more architectural information and some training details, see this first post: <a href="https:&#x2F;&#x2F;tincans.ai&#x2F;slm" rel="nofollow">https:&#x2F;&#x2F;tincans.ai&#x2F;slm</a> . For details about this model and some ideas for how to prompt it, see this post: <a href="https:&#x2F;&#x2F;tincans.ai&#x2F;slm3" rel="nofollow">https:&#x2F;&#x2F;tincans.ai&#x2F;slm3</a> . We trained this on a very limited budget but the model is able to do some things that even GPT-4, Gemini, and Claude cannot, eg reasoning about long-context audio directly, without transcription. We also believe that this is the first model in the world to conduct adversarial attacks and apply preference modeling in the speech domain.<p>The demo is unoptimized (unquantized bf16 weights, default Huggingface inference, serverless speed bumps) but achieves 120ms time to first token with audio. You can basically think of it as Mistral 7B, so it&#x27;ll be very fast and can also run basically anywhere. I am especially optimistic about embedded usage -- not needing the transcription step means that the resulting model is smaller and cheaper to use on the edge.<p>Would love to hear your thoughts and how you would use it! Weights are Apache-2 and on Hugging Face: <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;collections&#x2F;tincans-ai&#x2F;gazelle-v02-65f9b667385ba36893e82469" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;collections&#x2F;tincans-ai&#x2F;gazelle-v02-65...</a>

4 条评论

drippingfist大约 1 年前
I&#x27;m building various prototypes for VR training simulations using Inworld. But they also use the cascaded approach. Also, I am building customer service agent product which we would love to add voice to but whisper and eleven labs (and others) are just too slow. Is tincan available via API?
akrymski大约 1 年前
Very cool. If I ask to deduce the gender of my voice, can it do that? Training a projection layer makes sense, but ultimately you&#x27;d want to output audio conditioned on the input rather than text. Is there a way to train a reverse projection with some kind of skip connections to take audio input into account? Or an end to end audio model?
评论 #39796426 未加载
codekansas大约 1 年前
Very cool! How is this differentiated from ChatGPT voice?
评论 #39761334 未加载
Diris大约 1 年前
Very cool!!! I had this idea a while. Is the conversational part of the dataset open?