I wonder what is causing the bottleneck in the TLS case, particularly with kTLS where the CPU is only at 2% but the throughput remained the same. Perhaps there is a limited size stream buffer somewhere in the crypto side that doesn't allow high throughputs whereas the raw networking buffers are more dynamic? I know the Caddy guys read HN so maybe they can chime in with some actual knowledge :).