The "magic" of the fake Gemini demo was the way it seemed like the LLM was continually receiving audio + video input and knew when to jump in with a response.<p>It appeared to be able to wait until the user had finished the drawing, or even jumping in slightly before the drawing finished. At one point the LLM was halfway through a response and then saw the user was now colouring the duck in blue, and started talking about how the duck appearing to be blue. The LLM also appeared to know when a response wasn't needed because the user was just agreeing with the LLM.<p>I'm not sure how many people noticed that on a conscious level, but I positive everyone noticed it subconsciously, and felt the interaction was much more natural, and much more advanced than current LLMs.<p>-----------------<p>Checking the source code, the demo takes screenshots of the video feed every 800ms, waits until the user finishes taking and then sends the last three screenshots.<p>While this demo is impressive, it kind of proves just how unnatural it feels to interact with an LLM in this manner when it doesn't have continuous audio-video input. It's been technically possible to do kind of thing for a while, but there is a good reason why nobody tried to present it as a product.