It's time to replace TCP in the datacenter (2023)

189 pointsby ilove_banh_mi6 months ago

33 comments

As others have already hit upon, the problem forever lies in standardization of whatever is intended to replace TCP in the data center, or the lack thereof. You’re basically looking for a protocol supported in hardware from endpoint to endpoint, including in firewalls, switches, routers, load balancers, traffic shapers, proxies, etcetera - a very tall order indeed. Then, to add to that very expensive list of criteria, you also need the talent to support it - engineers who know it just as thoroughly as the traditional TCP/IP stack and ethernet frames, but now with the added specialty of data center tuning. Then you also need the software to support and understand it, which is up to each vendor and out of your control - unless you wrap/encapsulate it in TCP/IP anyway, in which case there goes all the nifty features you wanted in such a standard.By the time all of the proverbial planets align, all but the most niche or cutting edge customer is looking at a project the total cost of which could fund 400G endpoint bandwidth with the associated backhaul and infrastructure to support it - twice over. It’s the problem of diminishing returns against the problem of entrenchment: nobody is saying modern TCP is great for the kinds of datacenter workloads we’re building today, but the cost of solving those problems is prohibitively expensive for all but the most entrenched public cloud providers out there, and they’re not likely to “share their work” as it were. Even if they do (e.g., Google with QUIC), the broad vibe I get is that folks aren’t likely to trust those offerings as lacking in ulterior motives.

评论 #42173315 未加载

评论 #42174871 未加载

评论 #42179283 未加载

UltraSane6 months ago

I wonder why Fibre Channel isn't used as a replacement for TCP in the datacenter. It is a very robust L3 protocol. It was designed to connect block storage devices to servers while making the OS think they are directly connected. OSs do NOT tolerate dropped data when reading and writing to block devices and so Fibre Channel has a extremely robust Token Bucket algorithm. The algo prevents congestion by allowing receivers to control how much data senders can send. I have worked with a lot of VMware clusters that use FC to connect servers to storage arrays and it has ALWAYS worked perfectly.

评论 #42170384 未加载

评论 #42171890 未加载

评论 #42170465 未加载

评论 #42171057 未加载

评论 #42174071 未加载

评论 #42174140 未加载

评论 #42170698 未加载

评论 #42171576 未加载

评论 #42175585 未加载

parasubvert6 months ago

This has already been done at scale with HTTP/3 (QUIC), it's just not widely distributed beyond the largest sites & most popular web browsers. gRPC for example is still on multiplexed TCP via HTTP/2, which is "good enough" for many.Though it doesn't really replace TCP, it's just that the predominant requirements have changed (as Ousterhout points out). Bruce Davie has a series of articles on this: <a href="https://systemsapproach.substack.com/p/quic-is-not-a-tcp-replacement" rel="nofollow">https://systemsapproach.substack.com/p/quic-is-not-a-tcp-rep...</a>Also see Ivan Pepelnjak's commentary (he disagrees with Ousterhout): <a href="https://blog.ipspace.net/2023/01/data-center-tcp-replacement/" rel="nofollow">https://blog.ipspace.net/2023/01/data-center-tcp-replacement...</a>

评论 #42171930 未加载

评论 #42174573 未加载

评论 #42177673 未加载

akira25016 months ago

> If Homa becomes widely deployed, I hypothesize that core congestion will cease to exist as a significant networking problem, as long as the core is not systemically overloaded.Yep. Sure; but, what happens when it becomes overloaded?> Homa manages congestion from the receiver, not the sender. [...] but the remaining scheduled packets may only be sent in response to grants from the receiverI hypothesize it will not be a great day when you do become "systemically" overloaded.

评论 #42170099 未加载

评论 #42182735 未加载

wmf6 months ago

Previous discussions:Homa, a transport protocol to replace TCP for low-latency RPC in data centers <a href="https://news.ycombinator.com/item?id=28204808">https://news.ycombinator.com/item?id=28204808</a>Linux implementation of Homa <a href="https://news.ycombinator.com/item?id=28440542">https://news.ycombinator.com/item?id=28440542</a>

pif6 months ago

> Although Homa is not API-compatible with TCP,IPv6 anyone? People must start to understand that "Because this is the way it is" is a valid, actually extremely valid, answer to any question like "Why don't we just switch technology A with technology B?"Despite all the shortcomings of the old technology, and the advantages of the new one, inertia _is_ a factor, and you must accept that most users will simply even refuse to acknowledge the problem you want to describe.For you your solution to get any traction, it must deliver value right now, in the current ecosystem. Otherwise, it's doomed to fail by being ignored over and over.

评论 #42171235 未加载

评论 #42174115 未加载

评论 #42171846 未加载

评论 #42174849 未加载

slt20216 months ago

the problem with trying to replace TCP only inside DC, is because TCP will still be used outside DC.Networking Engineering is already convoluted and troublesome as it is right now, using only tcp stack.When you start using homa inside, but TCP from outside things will break, because a lot of DC requests are created as a response for an inbound request from outside DC (like a client trying to send RPC request).I cannot imagine trying to troubleshoot hybrid problems at the intersection of tcp and homa, its gonna be a nightmare.Plus I don't understand why create a a new L4 transport protocol for a specific L7 application (RPC)? This seems like a suboptimal choice, because RPC of today could be replaced with something completely different, like RDMA over Ethernet for AI workloads or transfer of large streams like training data/AI model state.I think tuning TCP stack in the kernel, adding more configuration knobs for TCP, switching from stream(tcp) to packet (udp) protocols where it is warranted, will give more incremental benefits.One major thing author missed is security applications, these are considered table stakes: 1. encryption in transit: handshake/negotiation 2. ability to intercept and do traffic inspection for enterprise security purposes 3. resistance to attacks like flood 4. security of sockets in containerized Linux environment

评论 #42170268 未加载

评论 #42170162 未加载

tonetegeatinst6 months ago

Many I am misunderstanding something about the issue, but isn't DCTCP a standard?See the rfc here: <a href="https://www.rfc-editor.org/rfc/rfc8257" rel="nofollow">https://www.rfc-editor.org/rfc/rfc8257</a>The DCTCP seems like its not a silver bullet to the issue, but it does seem to address some of the pitfalls of TCP in a HPC or data center environment. Iv even spun up some vm's and used some old hardware to play with it to see how smooth it is and what hurdles might exist, but that was so long ago and stuff has undoubtedly changed.

评论 #42175427 未加载

rwmj6 months ago

On a related topic, has anyone had luck deploying TCP fastopen in a data center? Did it make any difference?In theory for shortlived TCP connections, fastopen ought to be a win. It's very easy to implement in Linux (just a couple of lines of code in each client & server, and a sysctl knob). And the main concern about fastopen is middleboxes, but in a data center you can control what middleboxes are used.In practice I found in my testing that it caused strange issues, especially where one side was using older Linux kernels. The issues included not being able to make a TCP connection, and hangs. And when I got it working and benchmarked it, I didn't notice any performance difference at all.

runlaszlorun6 months ago

For those who might not have noticed, the author is John Ousterhout- best known for TCL/Tk as well as the Raft consensus protocol among others.

评论 #42170117 未加载

unsnap_biceps6 months ago

The original paper was discussed previously at <a href="https://news.ycombinator.com/item?id=33401480">https://news.ycombinator.com/item?id=33401480</a>

评论 #42170366 未加载

efitz6 months ago

I wonder which problem is bigger- modifying all the things to work with IPv6 only or modifying all the things to work with (something-yet-to-be-standardized-that-isn’t -TCP)?

tails4e6 months ago

The cost of standards is very high, probably second to the cost of no standards!Joking aside, I've seen this first thang when using things like ethernet/tcp to transfer huge amounts of data in hardware. The final encapsulation of the data is simple, but there are so many layers on top, and it adds huge overhead. Then stanrds have modes, and even if you use a subset the hardware must usually support all to be compliant, adding much most cost in hardware. A clean room design could save a lot of hardware power and area, but the loss of compatibility and interop would cost later in software.. hard problem to solve for sure.

7e6 months ago

TCP was replaced in the data centers of certain FAANG companies years before this paper.

评论 #42169769 未加载

评论 #42170102 未加载

评论 #42169770 未加载

mhandley6 months ago

It's already happening. For the more demanding workloads such as AI training, RDMA has been the norm for a while, either over Infiniband or Ethernet, with Ethernet gaining ground more recently. RoCE is pretty flawed though for reasons Ousterhout mentions, plus others, so a lot of work has been happening on new protocols to be implemented in hardware in next-gen high performance NICs.The Ultra Ethernet Transport specs aren't public yet so I can only quote the public whitepaper [0]:"The UEC transport protocol advances beyond the status quo by providing the following:● An open protocol specification designed from the start to run over IP and Ethernet● Multipath, packet-spraying delivery that fully utilizes the AI network without causing congestion or head-of-line blocking, eliminating the need for centralized load-balancing algorithms and route controllers● Incast management mechanisms that control fan-in on the final link to the destination host with minimal drop● Efficient rate control algorithms that allow the transport to quickly ramp to wire-rate while not causing performance loss for competing flows● APIs for out-of-order packet delivery with optional in-order completion of messages, maximizing concurrency in the network and application, and minimizing message latency● Scale for networks of the future, with support for 1,000,000 endpoints● Performance and optimal network utilization without requiring congestion algorithm parameter tuning specific to the network and workloads● Designed to achieve wire-rate performance on commodity hardware at 800G, 1.6T and faster Ethernet networks of the future"You can think of it as the love-child of NDP [2] (including support for packet trimming in Ethernet switches [1]) and something similar to Swift [3] (also see [1]).I don't know if UET itself will be what wins, but my point is the industry is taking the problems seriously and innovating pretty rapidly right now.Disclaimer: in a previous life I was the editor of the UEC Congestion Control spec.[0] <a href="https://ultraethernet.org/wp-content/uploads/sites/20/2023/10/23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf" rel="nofollow">https://ultraethernet.org/wp-content/uploads/sites/20/2023/1...</a>[1] <a href="https://ultraethernet.org/ultra-ethernet-specification-update/" rel="nofollow">https://ultraethernet.org/ultra-ethernet-specification-updat...</a>[2] <a href="https://ccronline.sigcomm.org/wp-content/uploads/2019/10/acmdl19-343.pdf" rel="nofollow">https://ccronline.sigcomm.org/wp-content/uploads/2019/10/acm...</a>[3] <a href="https://research.google/pubs/swift-delay-is-simple-and-effective-for-congestion-control-in-the-datacenter/" rel="nofollow">https://research.google/pubs/swift-delay-is-simple-and-effec...</a>

Woodi6 months ago

You want to replace TCP becouse it is bad ? Then give better "connected" protocol over raw IP and other raw network topologies. Use it. Done.Don't mess with another IP -> UDP -> something

评论 #42171341 未加载

ghaff6 months ago

There was a ton of effort and money especially during the dot-com boom to rethink datacenter communications. A fair number of things did happen under the covers--offload engines and the like. And InfiniBand persevered in HPC, albeit as a pretty pale shadow of what its main proponents hoped for--including Intel and seemingly half the startups in Austin.

KaiserPro6 months ago

TLDR: No, not its not.HOMA is great, but not good enough to justify a wholesale ripout of TCP in the "datacentre"Sure a lot of traffic is message oriented, but TCP is just a medium to transport those messages. Moreover its trivial to do external requests with TCP because its supported. There is not a need to have HOMA terminators at the edge of each datacentre to make sure that external RPC can be done.The author assumes that the main bottleneck to performance is TCP in a datacentre. Thats just not the case, in my datacentre, the main bottleneck is that 100gigs point to point isnt enough.

cletus6 months ago

Network protocls are slow to change. Just look at IPv6 adoption. Some of this is for good reason. Some isn't. Because of everything from threat reduction to lack of imagination equipment at every step of the process will tend to throw away anything that looks weird, a process somebody coined as ossification. You'll be surprised how long-lasting some of these things are.Story time: I worked on Google Fiber years ago. One of the things I worked on was on services to support the TV product. Now if you know anything about video delivery over IP you know you have lots of choices. There are also layers like the protocls, the container format and the transport protocol. The TV product, for whatever reason, used a transport protocol called MPEG2-TS (Transport Streams).What is that? It's a CBR (constant bit rate) protocol that stuffs 7 188 byte payloads into a single UDP packet that was (IPv4) multicast. Why 7? Well because 7 payloads (plus headers) was under 1500 bytes and you start to run into problems with any IP network once you have larger packets than that (ie an MTU of 1500 or 1536 is pretty standard). This is a big issue with high bandwidth NICs such that you have things like Jumbo frames to increase throughput and decrease CPU overhead but support is sketchy on a hetergenous network.Why 188 byte payloads? For compatibility with Asynchronous Transfer Mode ("ATM"), a long-dead fixed-packet size protocol (53 byte packets including 48 bytes of payload IIRC; I'm not sure how you get from 48 to 188 because 4x48=192) designed for fiber networks. I kind of thought of it was Fiber Channel 2.0. I'm not sure that's correct however.But my point is that this was an entirely owned and operated Google network and it still had 20-30+ year old decisions impacting its architecture.Back to Homa, three thoughts:1. Focusing on at-least once delivery instead of at-most once delivery seems like a good goal. It allows you to send the same packet twice. Plus you're worried about data offset, not ACKing each specific packet;2. Priority never seems to work out. Like this has been tried. IP has an urgent bit. You have QoS on even consumer routers. If you're saying it's fine to discard a packet then what happens to that data if the receiver is still expecting it? It's well-intentioned but I suspect it just won't work in practice, like it never has previously;3. Lack of connections also means lack of a standard model for encryption (ie SSL). Yes, encryption still matters inside a data center on purely internal connections;4. QUIC (HTTP3) has become the de-facto standard for this sort of thing, although it's largely implementing your own connections in userspace over UDP; and5. A ton of hardware has been built to optimize TCP and offload as much as possible from the CPU (eg checksumming packets). You see this effect with QUIC. It has significantly higher CPU overhad per payload byte than TCP does. Now maybe it'll catch up over time. It may also change as QUIC gets migrated into the Linux kernel (which is an ongoing project) and other OSs.

ksec6 months ago

Homa: A Receiver-Driven Low-Latency Transport Protocol Using Network Priorities<a href="https://people.csail.mit.edu/alizadeh/papers/homa-sigcomm18.pdf" rel="nofollow">https://people.csail.mit.edu/alizadeh/papers/homa-sigcomm18....</a>

dveeden26 months ago

Wasn't something like HOMA already tried with SCTP?

评论 #42170627 未加载

ltbarcly36 months ago

Just using IP or UDP packets lets you implement something like Homa in userspace.What's the advantage of making it a kernel level protocol?

javier_e066 months ago

In your data center usually there is a collection of switches from different vendors purchased through the years. Vendor A tries to outdo vendor B with some magic sauce that promise higher bandwidth. With open standards. To avoid vendor lock-ins. The risk averse manager knows the equipment might need to be re-used or re-sold elsewhere. Want to try something new? Plus: Who is ready to debug-maintain the new fancy standard?

kmeisthax6 months ago

Dumb question: why was it decided to only provide an unreliable datagram protocol in standard IP transit?

评论 #42171038 未加载

indolering6 months ago

So token ring?

GoblinSlayer6 months ago

> For many years, RDMA NICs could cache the state for only a few hundred connections; if the number of active connections exceeded the cache size, information had to be shuffled between host memory and the NIC, with a considerable loss in performance.A massively parallel task? Sounds like something doable with GPGPU.

评论 #42171362 未加载

评论 #42171369 未加载

cryptonector6 months ago

Various RDMA protocols were all about that. Where are they now?

评论 #42174612 未加载

yesbut6 months ago

Another thing not worth investing time into for the rest of our careers. TCP will be around for decades to come.

评论 #42170496 未加载

ezekiel686 months ago

I feel like there had ought to be a corollary to Betteridge's law which gets invoked whenever any blog, vlog, paper, or news headline that begins with "It's Time to..."But the new law doesn't simply negate the assertion. It comes back with: "Or else, what?"If this somehow catches on, I recommend the moniker "Valor's Law".

stiray6 months ago

How long did we need to support ipv6? Is it supported yet and more widely in use than the ipv4, like in mobile networks where everything is stashed behind NAT and ipv4 kept?Another protocol, something completely new? Good luck with that, i would rather bet on global warming to put us out of our misery (/s)...<a href="https://imgs.xkcd.com/comics/standards.png" rel="nofollow">https://imgs.xkcd.com/comics/standards.png</a>

评论 #42170393 未加载

freetanga6 months ago

So, back to the mainframe and SNA in the data centers?

评论 #42170319 未加载

gafferongames6 months ago

Game developers have been avoiding TCP for decades now. It's great to finally see datacenters catching up.

评论 #42175920 未加载

bmitc6 months ago

Unrelated to this article, are there any reasons to use TCP/IP over WebSockets? The latter is such a clean, message-based interface that I don't see a reason to use TCP/IP.

评论 #42171211 未加载

评论 #42170083 未加载

评论 #42170086 未加载

评论 #42170219 未加载