TechEcho

TLDR: Video chat with realistic AI persons: <a href="https://app.vivalabs.ai/chat">https://app.vivalabs.ai/chat</a>Hey AI enthusiasts,After GPT’s Advanced Voice Mode launch, we’re more excited than ever about authentic conversations with AI. We think the next frontier will be incorporating video feeds.On VivaChat, you can video chat with an AI therapist, digital marketing consultant, English language tutor, or a tech recruiter. We’re also beginning to support business use cases around customer support, HR, and education.Building this was a blast! We strung together multiple systems to coordinate an entire conversation: ASR + VAD models to process user speech, LLMs to generate text responses, TTS providers to generate speech, and a custom video generation model to generate lipsynced video frames.The key technical challenges were achieving respectable visual clarity (HD quality) and reasonable latency (~1s end to end). We trained custom NERF (neural radiance fields) models to generate video frames consistent with the generated audio in real-time. The vast majority of the latency is in the ASR + LLM + TTS flow, so we’re looking forward to replacing it with end to end speech models!Many lipsync models don’t generate frames in real time (at least 25 fps). Our model generates frames at 30-70 FPS, depending on the GPU. Currently, it requires 330 ms of audio generated ahead to generate a lip synced frame.Conversations usually involve periods of silence and sudden interruptions. We built on top of pipecat to handle this flow. Because a generated video frame is dependent on earlier frames, we have to throttle how much audio is processed by the NERF model.We wanted to make sure that users could talk to avatars ASAP while not breaking the bank. To this end, we use serverless GPU providers like Modal, and we spent time optimizing our cold start setup. The end to end cold start time for a conversation is about 10-20 seconds. It involves starting the containers, loading libraries, and loading model artifacts.Shoutout to Modal’s Memory Snapshot feature! And shoutout to the Daily team for pipecat and video chat SDKs!We’re actively experimenting with approaches to increase visual clarity, emotional expressiveness, rendering speed, and training robustness.While the implicit 3D representation of NERFs is powerful, we’ve found them to be quite sensitive to training data when representing dynamic, time-dependent scenes. We’ll be exploring other 3D model approaches like Gaussian Splatting, which also has faster rendering speeds. We’re considering 2D approaches, such as GANs or latent consistency models, as well.Our model’s handling of emotional expressions is still limited, as we focused primarily on generating accurate lipsync. To this end, we’d like to more explicitly model how facial expressions and head poses vary with audio.--Try VivaChat for free: <a href="https://app.vivalabs.ai/chat">https://app.vivalabs.ai/chat</a>Any inquiries (e.g. custom avatars, API access, custom conversation topics, etc.): founders@vivalabs.ai

Show HN: VivaChat – FaceTime with AI

no comments

Show HN: VivaChat – FaceTime with AI

no comments