I did some research into this about a year ago. Some fun facts I learned:<p>- The median delay between speakers in a human to human conversation is zero milliseconds. In other words, about 1/2 the time, one speaker interrupts the other, making the delay negative.<p>- Humans don't care about delays when speaking to known AIs. They assume the AI will need time to think. Most users will qualify a 1000ms delay is acceptable and a 500ms delay as exceptional.<p>- Every voice assistant up to that point (and probably still today) has a minimum delay of about 300ms, because they all use silence detection to decide when to start responding, and you need about 300ms of silence to reliably differentiate that from a speaker's normal pause<p>- Alexa actually has a setting to increase this wait time for slower speakers.<p>You'll notice in this demo video that the AI never interrupts him, which is what makes it feel like a not quite human interaction (plus the stilted intonations of the voice).<p>Humans appear to process speech in a much more steaming why, constantly updating their parsing of the sentence until they have a high enough confidence level to respond, but using context clues and prior knowledge.<p>For a voice assistant to get the "human" levels, it will have to work more like this, where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning.