> The connectx() functionality is valuable, and should be added to Linux one way or another. It's not trivial to get all its use cases right. Hopefully, this blog post shows how to achieve this in the best way given the operating system API constraints.<p>I think this write-up is valuable since it shows a deficiency that impacts real-world industry usage. This kind of public explanation of the problem and workaround is perfect ammunition to get support behind such an initiative. Also given that darwin already supports this is another point in their favour.<p>A few times during this read I thought "that was a bad decision on their part" but upon completing the article I changed my mind. Their needs are definitely complex but they aren't unreasonable. It seems like having kernel support for their use cases is appropriate.
A few things that weren’t mentioned in detail or which I skimmed over without noticing:<p>- The problem with bind before connect is that the OS thinks you might call listen after bind instead of connect and listen requires the src port/ip to be unique.<p>- Running out of ports may affect short-lived connections too: by default when you (the client) close a connection, it goes into a ‘TIME_WAIT’ state that locks up the {src,dst}_{ip,port} quadruple for (by default) 2 minutes to protect new connections with that quadruple getting packets that were meant for the previous connection.<p>I think one thing that weakly surprised me about the article was that cloud flare seem to be mostly using standard BSD sockets apis and, presumably, the kernel interface to the network cards. I would have expected their main products (ie being a CDN) to use more specialised network cards and userspace networking (eg the final function they give for UDP does a ton of syscalls, and for tcp it still does several) with a non-BSD api. The main advantage would be having more control over their system and the ability to avoid a lot of unnecessary overhead from syscalls or apis, but there could be other advantages like NIC TLS (though this can also be accessed through the kernel as kernel TLS).<p>I’m sure cloudflare have reasons for it though. Maybe the hardware is more expensive than underutilisation or hard to source or buggy or user space networking just doesn’t improve things much. Many of those things can make servers more efficient but that might not be necessary if many servers are needed at lots of edge locations with each server not needing efficiency. Or maybe those things are mostly just saving mics when the latency over the wire is millis.
I always though the 16 bit port was too little.<p>GNUNet CADET (admittedly a very experimental and non-production system) uses a 512 bit port number. A port can be a string, like "https" or "my-secret-port" which is hashed with SHA-512 to produce the port number. I like this idea very much and dream of it becoming a reality on the web.
A fantastic article that deals with an issue of real, practical importance. I was surprised, however, to hear no mention of file descriptor limits, and I'm curious as to why that's not relevant. I think the article could be at least marginally improved by adding a note about this topic, particularly around the time it discusses the 28k available ephemeral ports available.
This article is nice and it shows clearly what kind of problems servers for TCP-based infrastructure on the internet have to face.<p>The issue I have with it: It's showing how the problem can be solved for systems that you are in control of, but not for systems (clients) that you aren't in control of.<p>The reason why servers on the web are so easily DDoS-able is because of long-lived connection defaults among most higher-level protocol implementations...and because ISPs all over the globe intentionally throttle initial connections by up to 30-60 seconds for the initial SYN/SYN-ACK when their "flatrates" run out on mobile networks.<p>Server administrators then face the decision to either drop the mobile world completely or be faced with a situation where a couple malicious actors can take down their infrastructure with simple methods like even a slowloris or a SYN flood attack.<p>This is also not discussing the problem that ISPs allow (relayed) amplification attacks because they don't think it's their responsibility to track where traffic comes from and where traffic is supposed to go; whereas I would disagree on that.<p>If I would have influence like cloudflare in the IETF/IANA/Internet Society, I'd try to push initiatives like this:<p>- Disallow ISPs from spreading network traffic that they can easily detect as amplification attacks. If UDP origin in the request packet != UDP target in the response packet, simply drop it.<p>- Force ISPs to block SYN floods. This isn't hard to do on their end, but it gets harder to do inside your own little infrastructure. If an ASN doesn't block SYN floods regularly, punish them with blocked traffic.<p>- Force lower network protocol timeouts. If a socket's network timeout is 60 seconds, it's total overkill for this small planet and only serves to allow this shitty behaviour of ISPs and nothing more. Even in the world of 56k modems, this doesn't serve a purpose anymore.<p>- (like Google) Try to push UDP-based networking stacks that (with above exception) try to solve this connection timeout problem by allowing to dynamically multiplex / multi-pipeline traffic through the framing protocol of the same UDP socket.
Nice article and the UDP section shows some heavy socket wizardry.<p>As they require all the clients to cooperate in the use of the SO_REUSEADDR pseudomutex, I wonder if some more explicit process-shared resources would have been better.
28000 ephemeral ports is enough for the c10k problem, but 100k+ seems a stretch. At what point is increasing the number of ports from 28k to a higher number the right answer? Reuse as described here sounds like a useful optimization but at some point (or due to pathological workload) even that will be exhausted I’d think? What to do then
Just one more reason to redesign the tcp/ip stack. Can you imagine what our ABIs will look like if we're still hacking in kludges for another 40+ years? Opening a connection "the right way" will look like casting a spell with voodoo. Ooh, maybe we'll get cooler titles then, like Master of the Dark Network Arts.
This article is written as though Linux isn't an open source OS. Basically every big player rolls their own kernel to get features they want. This use case here is pretty exotic and for 99% of people using Linux it's perfectly fine for their needs.