I'm impressed by the latency using a request response. It looks this uses speech detection locally using Silero voice activity detector model using the ONNX web runtime, collects audio, then performs a POST. It doesn't look like the POST is submitted though until I'm done speaking. The response depends on chaining together several AI APIs that themselves are very, very fast to provide a seamless experience.<p>This is very good. But this is, unfortunately, still bound by the dominant paradigm of web APIs. The speech to text model doesn't get its first byte until I'm done talking, the LLM doesn't get its first byte until the speech to text model is done transcribing, and the speech to text model doesn't get its first byte until the LLM call is complete.<p>When all of these things are very fast, it can be very seamless, but each of these contributes to a floor of latency that makes it hard to get to lifelike conversation. Most of these models should be capable of streaming prefill - if not decode (for the transformer like models) - but inference servers are targeting the lowest common denominator on the web: a synchronous POST.<p>When only 3 very fast models are involved, that's great. But this only compounds when trying to combine these with agentic systems, tool calling.<p>The sooner we adopt end-to-end, bidirectional streaming for AI, the sooner we'll reach more lifelike, friendly, low latency experiences. After all, inter-speaker gaps in person to person conversations are often in the sub-100ms range and between friends, can even be negative! We won't have real "agents" until models can interrupt one another and talk over each other. Otherwise these latencies compound to a pretty miserable experience.<p>Relatedly, Guillermo - I've contributed PRs to reduce the latency of tool calling APIs to the AI SDK and Websockets to Next.js. Let's break free of request-response and remove the floor on latency.