QUIC costs something like 2x to 4x as much CPU time to serve large files or streams per byte as compared to TCP. This is because the anti-middlebox protections <i>also</i> mean that modern network hardware and software offloads that greatly reduce CPU time cannot work with QUIC. When combined with the fact that QUIC is userspace, that's just deadly for performance. I'm talking about TSO, LRO (aka GRO), kTLS, and kTLS + hw encryption.<p>Let's compare a 100MB file served via TCP to the same file served via QUIC.<p><pre><code> TCP:
- web server sends 2MB at a time, 50x times, via async sendfile (50 syscalls & kqueue notifications)
- kernel reads data from disk, and encrypts. The data is read once and written once by KTLS in the kernel.
- TCP sends data to the NIC in large-sh chunks 1.5k to 64k at a time, lets say an average of 16k. So the network stack runs 6250 times to transmit.
- The client acks every other frame, so that's 33333 acks. Let's say they are collapsed 2:1 by LRO, so the TCP stack runs 16,666 times to process acks
QUIC:
- web server mmaps or read()'s the file and encrypts it in userspace and sends it 1500b at a time (1 extra memory copy & 66,666 system calls)
- UDP stack runs 66,666 to send data
- UDP stack runs 33,333 number of times to receive QUIC acks (no idea what the aggregation is, lets say 2:1)
- kernel wakes up web server to process QUIC acks 33,333 times.
</code></pre>
So for QUIC we have:<p><pre><code> - 4x as many network stack traversals due to the lack of TSO/LRO.
- 1000x as many system calls, due to doing all the packet handing in userspace
- at least one more data copy (kernel -> user) due to data handling in userspace.
</code></pre>
Some of these can be solved, by either moving QUIC into the kernel, or by using a DPDK-like userspace networking solution. However, the lack of TSO/LRO even by itself is a killer for performance.<p>Disclaimer: I work on CDN performance. We've served 90Gb/s with a 12-core Xeon-D. To serve the same amount of traffic with QUIC, you'd probably need multiple Xeon Gold CPUS. I guess that Google can afford this.