I built a realtime visual intelligence that connects a users phone camera to a multimodal llm. I use the pipecat open source framework, webrtc, and a few other services to connect it all together.<p>It's similar to chatgpt advanced voice and grounded with google_search for asynch internet searches based on transcripts or frames from the video that run at 1fps to the LLM.<p>Let me know what you think and if you want to work on some fun scaling problems with me on this project.<p>www.withsen.com
One interesting note with voice AI is that you can shove static datasets into the long context windows of these newer models like 2.0-flash-lite. It creates a Model Assisted Generation(MAG) and returns super low latency and 99% relevant information to the bot. Theres a good example in the foundational example of the pipecat github.