Hi guys,<p>After some research and no luck finding anyone that seems to be working on this, I thought I'd try a Hail Mary and post on here.<p>I'm looking to speak to anyone who is working on speech-to-video (real-time speech rendering). We already have software which can take audio (speech) input and render a video which resembles a person or avatar speaking, but it takes a long time to render.<p>How long will it be before the video of the person/avatar speaking will be renderable in near real-time, with similar latency to existing speech-to-text models?<p>What would the prototype look like to reduce the latency? Is anyone working on anything like this?<p>For context, I run a language learning app where you can practice speaking orally with AI. It would be far more engaging if the user had an avatar/person to be able to speak to, rather than staring at the chat history whilst talking to the AI conversation partner.<p>Thanks,
Chris<p>For context, here's the original post: https://news.ycombinator.com/item?id=36973400