I built a realtime visual intelligence that connects a users phone camera to a multimodal llm. I use the pipecat open source framework, webrtc, and a few other services to connect it all together.<p>It's similar to chatgpt advanced voice and grounded with google_search for asynch internet searches based on transcripts or frames from the video that run at 1fps to the LLM.<p>Let me know what you think and if you want to work on some fun scaling problems with me on this project.<p>www.withsen.com