As others have already hit upon, the problem forever lies in standardization of whatever is intended to replace TCP in the data center, or the lack thereof. You’re basically looking for a protocol supported in hardware from endpoint to endpoint, including in firewalls, switches, routers, load balancers, traffic shapers, proxies, etcetera - a <i>very</i> tall order indeed. Then, to add to that very expensive list of criteria, you also need the talent to support it - engineers who know it just as thoroughly as the traditional TCP/IP stack and ethernet frames, but now with the added specialty of data center tuning. <i>Then</i> you also need the software to support and understand it, which is up to each vendor and out of your control - unless you wrap/encapsulate it in TCP/IP anyway, in which case there goes all the nifty features you wanted in such a standard.<p>By the time all of the proverbial planets align, all but the most niche or cutting edge customer is looking at a project the total cost of which could fund 400G endpoint bandwidth with the associated backhaul and infrastructure to support it - twice over. It’s the problem of diminishing returns against the problem of entrenchment: nobody is saying modern TCP is great for the kinds of datacenter workloads we’re building today, but the cost of solving those problems is prohibitively expensive for all but the most entrenched public cloud providers out there, and they’re not likely to “share their work” as it were. Even if they do (e.g., Google with QUIC), the broad vibe I get is that folks aren’t likely to trust those offerings as lacking in ulterior motives.
I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter. It is a very robust L3 protocol. It was designed to connect block storage devices to servers while making the OS think they are directly connected. OSs do NOT tolerate dropped data when reading and writing to block devices and so Fibre Channel has a extremely robust Token Bucket algorithm. The algo prevents congestion by allowing receivers to control how much data senders can send. I have worked with a lot of VMware clusters that use FC to connect servers to storage arrays and it has ALWAYS worked perfectly.
This has already been done at scale with HTTP/3 (QUIC), it's just not widely distributed beyond the largest sites & most popular web browsers. gRPC for example is still on multiplexed TCP via HTTP/2, which is "good enough" for many.<p>Though it doesn't really replace TCP, it's just that the predominant requirements have changed (as Ousterhout points out). Bruce Davie has a series of articles on this: <a href="https://systemsapproach.substack.com/p/quic-is-not-a-tcp-replacement" rel="nofollow">https://systemsapproach.substack.com/p/quic-is-not-a-tcp-rep...</a><p>Also see Ivan Pepelnjak's commentary (he disagrees with Ousterhout): <a href="https://blog.ipspace.net/2023/01/data-center-tcp-replacement/" rel="nofollow">https://blog.ipspace.net/2023/01/data-center-tcp-replacement...</a>
> If Homa becomes widely deployed, I hypothesize that core congestion will cease to exist as a significant networking problem, as long as the core is not systemically overloaded.<p>Yep. Sure; but, what happens when it becomes overloaded?<p>> Homa manages congestion from the receiver, not the sender. [...] but the remaining scheduled packets may only be sent in response to grants from the receiver<p>I hypothesize it will not be a great day when you do become "systemically" overloaded.
Previous discussions:<p>Homa, a transport protocol to replace TCP for low-latency RPC in data centers <a href="https://news.ycombinator.com/item?id=28204808">https://news.ycombinator.com/item?id=28204808</a><p>Linux implementation of Homa <a href="https://news.ycombinator.com/item?id=28440542">https://news.ycombinator.com/item?id=28440542</a>
> Although Homa is not API-compatible with TCP,<p>IPv6 anyone? People must start to understand that "Because this is the way it is" is a valid, actually extremely valid, answer to any question like "Why don't we just switch technology A with technology B?"<p>Despite all the shortcomings of the old technology, and the advantages of the new one, inertia _is_ a factor, and you must accept that most users will simply even refuse to acknowledge the problem you want to describe.<p>For you your solution to get any traction, it must deliver value right now, in the current ecosystem. Otherwise, it's doomed to fail by being ignored over and over.
the problem with trying to replace TCP only inside DC, is because TCP will still be used outside DC.<p>Networking Engineering is already convoluted and troublesome as it is right now, using only tcp stack.<p>When you start using homa inside, but TCP from outside things will break, because a lot of DC requests are created as a response for an inbound request from outside DC (like a client trying to send RPC request).<p>I cannot imagine trying to troubleshoot hybrid problems at the intersection of tcp and homa, its gonna be a nightmare.<p>Plus I don't understand why create a a new L4 transport protocol for a specific L7 application (RPC)? This seems like a suboptimal choice, because RPC of today could be replaced with something completely different, like RDMA over Ethernet for AI workloads or transfer of large streams like training data/AI model state.<p>I think tuning TCP stack in the kernel, adding more configuration knobs for TCP, switching from stream(tcp) to packet (udp) protocols where it is warranted, will give more incremental benefits.<p>One major thing author missed is security applications, these are considered table stakes:
1. encryption in transit: handshake/negotiation
2. ability to intercept and do traffic inspection for enterprise security purposes
3. resistance to attacks like flood
4. security of sockets in containerized Linux environment
Many I am misunderstanding something about the issue, but isn't DCTCP a standard?<p>See the rfc here: <a href="https://www.rfc-editor.org/rfc/rfc8257" rel="nofollow">https://www.rfc-editor.org/rfc/rfc8257</a><p>The DCTCP seems like its not a silver bullet to the issue, but it does seem to address some of the pitfalls of TCP in a HPC or data center environment. Iv even spun up some vm's and used some old hardware to play with it to see how smooth it is and what hurdles might exist, but that was so long ago and stuff has undoubtedly changed.
On a related topic, has anyone had luck deploying TCP fastopen in a data center? Did it make any difference?<p>In theory for shortlived TCP connections, fastopen ought to be a win. It's very easy to implement in Linux (just a couple of lines of code in each client & server, and a sysctl knob). And the main concern about fastopen is middleboxes, but in a data center you can control what middleboxes are used.<p>In practice I found in my testing that it caused strange issues, especially where one side was using older Linux kernels. The issues included not being able to make a TCP connection, and hangs. And when I got it working and benchmarked it, I didn't notice any performance difference at all.
The original paper was discussed previously at <a href="https://news.ycombinator.com/item?id=33401480">https://news.ycombinator.com/item?id=33401480</a>
I wonder which problem is bigger- modifying all the things to work with IPv6 only or modifying all the things to work with (something-yet-to-be-standardized-that-isn’t -TCP)?
The cost of standards is very high, probably second to the cost of no standards!<p>Joking aside, I've seen this first thang when using things like ethernet/tcp to transfer huge amounts of data in hardware. The final encapsulation of the data is simple, but there are so many layers on top, and it adds huge overhead. Then stanrds have modes, and even if you use a subset the hardware must usually support all to be compliant, adding much most cost in hardware. A clean room design could save a lot of hardware power and area, but the loss of compatibility and interop would cost later in software.. hard problem to solve for sure.
It's already happening. For the more demanding workloads such as AI training, RDMA has been the norm for a while, either over Infiniband or Ethernet, with Ethernet gaining ground more recently. RoCE is pretty flawed though for reasons Ousterhout mentions, plus others, so a lot of work has been happening on new protocols to be implemented in hardware in next-gen high performance NICs.<p>The Ultra Ethernet Transport specs aren't public yet so I can only quote the public whitepaper [0]:<p>"The UEC transport protocol advances beyond the status quo by providing the following:<p>● An open protocol specification designed from the start to run over IP and Ethernet<p>● Multipath, packet-spraying delivery that fully utilizes the AI network without causing congestion or head-of-line blocking, eliminating the need for centralized load-balancing algorithms and route controllers<p>● Incast management mechanisms that control fan-in on the final link to the destination host with minimal drop<p>● Efficient rate control algorithms that allow the transport to quickly ramp to wire-rate while not causing performance loss for competing flows<p>● APIs for out-of-order packet delivery with optional in-order completion of messages, maximizing concurrency in the network and application, and minimizing message latency<p>● Scale for networks of the future, with support for 1,000,000 endpoints<p>● Performance and optimal network utilization without requiring congestion algorithm parameter tuning specific to the network and workloads<p>● Designed to achieve wire-rate performance on commodity hardware at 800G, 1.6T and faster Ethernet networks of the future"<p>You can think of it as the love-child of NDP [2] (including support for packet trimming in Ethernet switches [1]) and something similar to Swift [3] (also see [1]).<p>I don't know if UET itself will be what wins, but my point is the industry is taking the problems seriously and innovating pretty rapidly right now.<p>Disclaimer: in a previous life I was the editor of the UEC Congestion Control spec.<p>[0] <a href="https://ultraethernet.org/wp-content/uploads/sites/20/2023/10/23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf" rel="nofollow">https://ultraethernet.org/wp-content/uploads/sites/20/2023/1...</a><p>[1] <a href="https://ultraethernet.org/ultra-ethernet-specification-update/" rel="nofollow">https://ultraethernet.org/ultra-ethernet-specification-updat...</a><p>[2] <a href="https://ccronline.sigcomm.org/wp-content/uploads/2019/10/acmdl19-343.pdf" rel="nofollow">https://ccronline.sigcomm.org/wp-content/uploads/2019/10/acm...</a><p>[3] <a href="https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/" rel="nofollow">https://research.google/pubs/swift-delay-is-simple-and-effec...</a>
You want to replace TCP becouse it is bad ? Then give better "connected" protocol over raw IP and other raw network topologies. Use it. Done.<p>Don't mess with another IP -> UDP -> something
There was a ton of effort and money especially during the dot-com boom to rethink datacenter communications. A fair number of things did happen under the covers--offload engines and the like. And InfiniBand persevered in HPC, albeit as a pretty pale shadow of what its main proponents hoped for--including Intel and seemingly half the startups in Austin.
TLDR: No, not its not.<p>HOMA is great, but not good enough to justify a wholesale ripout of TCP in the "datacentre"<p>Sure a lot of traffic is message oriented, but TCP is just a medium to transport those messages. Moreover its trivial to do external requests with TCP because its supported. There is not a need to have HOMA terminators at the edge of each datacentre to make sure that external RPC can be done.<p>The author assumes that the main bottleneck to performance is TCP in a datacentre. Thats just not the case, in my datacentre, the main bottleneck is that 100gigs point to point isnt enough.
Network protocls are slow to change. Just look at IPv6 adoption. Some of this is for good reason. Some isn't. Because of everything from threat reduction to lack of imagination equipment at every step of the process will tend to throw away anything that looks <i>weird</i>, a process somebody coined as <i>ossification</i>. You'll be surprised how long-lasting some of these things are.<p>Story time: I worked on Google Fiber years ago. One of the things I worked on was on services to support the TV product. Now if you know anything about video delivery over IP you know you have lots of choices. There are also layers like the protocls, the container format and the transport protocol. The TV product, for whatever reason, used a transport protocol called MPEG2-TS (Transport Streams).<p>What is that? It's a CBR (constant bit rate) protocol that stuffs 7 188 byte payloads into a single UDP packet that was (IPv4) multicast. Why 7? Well because 7 payloads (plus headers) was under 1500 bytes and you start to run into problems with any IP network once you have larger packets than that (ie an MTU of 1500 or 1536 is pretty standard). This is a big issue with high bandwidth NICs such that you have things like Jumbo frames to increase throughput and decrease CPU overhead but support is sketchy on a hetergenous network.<p>Why 188 byte payloads? For <i>compatibility with Asynchronous Transfer Mode ("ATM")</i>, a long-dead fixed-packet size protocol (53 byte packets including 48 bytes of payload IIRC; I'm not sure how you get from 48 to 188 because 4x48=192) designed for fiber networks. I kind of thought of it was Fiber Channel 2.0. I'm not sure that's correct however.<p>But my point is that this was an entirely owned and operated Google network and it still had 20-30+ year old decisions impacting its architecture.<p>Back to Homa, three thoughts:<p>1. Focusing on at-least once delivery instead of at-most once delivery seems like a good goal. It allows you to send the same packet twice. Plus you're worried about data offset, not ACKing each specific packet;<p>2. Priority never seems to work out. Like this has been tried. IP has an urgent bit. You have QoS on even consumer routers. If you're saying it's fine to discard a packet then what happens to that data if the receiver is still expecting it? It's well-intentioned but I suspect it just won't work in practice, like it never has previously;<p>3. Lack of connections also means lack of a standard model for encryption (ie SSL). Yes, encryption still matters inside a data center on purely internal connections;<p>4. QUIC (HTTP3) has become the de-facto standard for this sort of thing, although it's largely implementing your own connections in userspace over UDP; and<p>5. A ton of hardware has been built to optimize TCP and offload as much as possible from the CPU (eg checksumming packets). You see this effect with QUIC. It has significantly higher CPU overhad per payload byte than TCP does. Now maybe it'll catch up over time. It may also change as QUIC gets migrated into the Linux kernel (which is an ongoing project) and other OSs.
Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities<p><a href="https://people.csail.mit.edu/alizadeh/papers/homa-sigcomm18.pdf" rel="nofollow">https://people.csail.mit.edu/alizadeh/papers/homa-sigcomm18....</a>
In your data center usually there is a collection of switches from different vendors purchased through the years. Vendor A tries to outdo vendor B with some magic sauce that promise higher bandwidth. With open standards. To avoid vendor lock-ins. The risk averse manager knows the equipment might need to be re-used or re-sold elsewhere. Want to try something new? Plus: Who is ready to debug-maintain the new fancy standard?
> For many years, RDMA
NICs could cache the state for only a few hundred connections; if the number of active connections exceeded the cache
size, information had to be shuffled between host memory and
the NIC, with a considerable loss in performance.<p>A massively parallel task? Sounds like something doable with GPGPU.
I feel like there had ought to be a corollary to Betteridge's law which gets invoked whenever any blog, vlog, paper, or news headline that begins with "It's Time to..."<p>But the new law doesn't simply negate the assertion. It comes back with: "Or else, what?"<p>If this somehow catches on, I recommend the moniker "Valor's Law".
How long did we need to support ipv6? Is it supported yet and more widely in use than the ipv4, like in mobile networks where everything is stashed behind NAT and ipv4 kept?<p>Another protocol, something completely new? Good luck with that, i would rather bet on global warming to put us out of our misery (/s)...<p><a href="https://imgs.xkcd.com/comics/standards.png" rel="nofollow">https://imgs.xkcd.com/comics/standards.png</a>
Unrelated to this article, are there any reasons to use TCP/IP over WebSockets? The latter is such a clean, message-based interface that I don't see a reason to use TCP/IP.