TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Open source framework OpenAI uses for Advanced Voice

266 点作者 russ7 个月前
Hey HN, we&#x27;ve been working with OpenAI for the past few months on the new Realtime API.<p>The goal is to give everyone access to the same stack that underpins Advanced Voice in the ChatGPT app.<p>Under the hood it works like this: - A user&#x27;s speech is captured by a LiveKit client SDK in the ChatGPT app - Their speech is streamed using WebRTC to OpenAI’s voice agent - The agent relays the speech prompt over websocket to GPT-4o - GPT-4o runs inference and streams speech packets (over websocket) back to the agent - The agent relays generated speech using WebRTC back to the user’s device<p>The Realtime API that OpenAI launched is the websocket interface to GPT-4o. This backend framework covers the voice agent portion. Besides having additional logic like function calling, the agent fundamentally proxies WebRTC to websocket.<p>The reason for this is because websocket isn’t the best choice for client-server communication. The vast majority of packet loss occurs between a server and client device and websocket doesn’t provide programmatic control or intervention in lossy network environments like WiFi or cellular. Packet loss leads to higher latency and choppy or garbled audio.

11 条评论

racecar7897 个月前
Imagine being able to tell an app to call the IRS during the day, endure the on-hold wait times, then ask the question to the IRS rep and log the answer. Then deliver the answer when you get home.<p>Or, have the app call a pharmacy every month to refill prescriptions. For some drugs, the pharmacy requires a manual phone call to refill which gets very annoying.<p>So many use cases for this.
评论 #41749520 未加载
评论 #41751960 未加载
评论 #41753499 未加载
throw140820207 个月前
This is really helpful, thanks!<p>OpenAI hired the ex fractional CTO of LiveKit, who created Pion, a popular WebRTC library&#x2F;tool.<p>I&#x27;d expect OpenAI to migrate off of LiveKit within 6 months. LiveKit is too expensive. Also, WebRTC is hard, and OpenAI now being a less open company will want to keep improvements to itself.<p>Not affiliated with any competitors, but I did work at a PaaS company similar to LiveKit but used Websockets instead.
评论 #41750106 未加载
评论 #41751599 未加载
pj_mukh7 个月前
Super cool! Didn&#x27;t realize OpenAI is just using LiveKit.<p>Does the pricing breakdown to be the same as having a OpenAI Advanced Voice socket open the whole time? It&#x27;s like $9&#x2F;hr!<p>It would be theoretically cheaper to use this without keeping the advanced voice socket open the whole time and just use the GPT4o streaming service [1] for whenever inference is needed (pay per token) and use livekits other components to do the rest (TTS, VAD etc.).<p>What&#x27;s the trade off here?<p>[1]: <a href="https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;api-reference&#x2F;streaming" rel="nofollow">https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;api-reference&#x2F;streaming</a>
评论 #41746382 未加载
评论 #41747550 未加载
solarkraft7 个月前
That’s some crazy marketing for a „our library happened to support this relatively simple use case“ situation. Impressive!<p>By the way: The cerebras voice demo <i>also</i> uses LiveKit for this: <a href="https:&#x2F;&#x2F;cerebras.vercel.app&#x2F;" rel="nofollow">https:&#x2F;&#x2F;cerebras.vercel.app&#x2F;</a>
评论 #41747366 未加载
FanaHOVA7 个月前
Olivier, Michelle, and Romain gave you guys a shoutout like 3 times in our DevDay recap podcast if you need more testimonial quotes :) <a href="https:&#x2F;&#x2F;www.latent.space&#x2F;p&#x2F;devday-2024" rel="nofollow">https:&#x2F;&#x2F;www.latent.space&#x2F;p&#x2F;devday-2024</a>
评论 #41746424 未加载
评论 #41746658 未加载
spuz7 个月前
Is there anyone besides OpenAI working on a speech to speech model? I find it incredibly useful and it&#x27;s the sole reason that I pay for their service but I do find it very limited. I&#x27;d be interested to know if any other groups are doing research on voice models.
评论 #41748593 未加载
评论 #41750286 未加载
评论 #41749249 未加载
mycall7 个月前
I wonder when Azure OpenAI will get this.
评论 #41746393 未加载
0x1ceb00da7 个月前
This suggests that the AI &quot;brain&quot; receives the user input as text prompt (agent relays the speech prompt to GPT-4o) and generates audio as output (GPT-4o streams speech packets back to the agent).<p>But when I asked advanced voice mode it said the exact opposite. That it receives input as audio and generates text as output.
评论 #41749638 未加载
评论 #41749286 未加载
gastonmorixe7 个月前
Nice they have many partners on this. I see Azure as well.<p>There is a common consensus that the new Realtime API is not actually using the same Advanced Voice model &#x2F; engine - or however it works - since at least the TTS part doesn’t seem to be as capable as the one shipped with the official OpenAI app.<p>Any idea on this?<p>Source: <a href="https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;openai-realtime-api-beta&#x2F;issues&#x2F;2">https:&#x2F;&#x2F;github.com&#x2F;openai&#x2F;openai-realtime-api-beta&#x2F;issues&#x2F;2</a>
评论 #41746421 未加载
lolpanda7 个月前
so the WebRTC helps with the unreliable network between the mobile clients and the server side. if the application is backend only, would it make sense to use WebRTC or should I go directly to realtime api?
willsmith727 个月前
That was cool, but got up to $1 usage real quick
评论 #41746455 未加载