Stop Buying Load Balancers and Start Controlling Your Traffic Flow with Software

166 pointsby danmccormalmost 11 years ago

16 comments

mattzitoalmost 11 years ago

I'm really reaching back into the depths of my memory, but I've implemented this in the past. It's not quite as simple as they make it sound - there's a lot of sticky edge cases that crop up here (some of which have no doubt been addressed in subsequent years).- It heavily limits the number of nodes you can have - that is something the article does say, but I want to highlight here. It strikes me as a really bad strategy for scale-out.- I've run into weirdness with a variety of different router platforms (Linux, Cisco, Foundry) when you withdraw and publish BGP routes over and over and over again (i.e. you have a flapping/semi-available service).- It is true that when a node goes down, the BGP dead peer detection will kick in and remove the node. However the time to remove the node will vary, and require tuning on the router/switch side of things.This is a fairly crude implement to swing - machete rather than a scalpel. You lose a lot of the flexibility load balancers give you, and depend a lot more on software stacks you have less insight and visibility into (router/switches) and are also not designed to do this.My suggestion would be that this is a great way to scale across multiple load balancers/haproxy nodes. Use BGP to load balance across individual haproxy nodes - that keeps the neighbor count low, minimizes flapping scenarios, and you get to keep all the flexibility a real load balancer gives you.One last note - the OP doesn't talk about this, but the trick I used back in the day was that I actually advertised a /24 (or /22, maybe?) from my nodes to my router, which then propagated it to a decent chunk of the Internet. This is good for doing CloudFlare-style datacenter distribution, but has the added benefit that if all of your nodes go down, the BGP route will be withdrawn automatically, and traffic will stop flowing to that datacenter. Also makes maintenance a lot easier.

评论 #7812256 未加载

donavanmalmost 11 years ago

This works well if you need to push high bit rates and are lookig for relatively simple load balancing. A trident box, ala juniper qfx, can push a few hundred gbs for ~$25,000. Thats an incredibly low price point compared to any other lb solution.Some caveats and comments about the technique.BGP & ExaBGP are implementation details. OSPF, quagga, & bird will all accomplish the same thing. Use whatevere your comfortable with.Scale out can get arbitrarily wide. At a simplistic design youll ECMP on the device (TOR) where your hosts are connected. Any network device will give you 8 way ECMP. Most Junos stuff does up to 32 way today, and 64 way with an update. You can ECMP before that as well, in your agg or border layer. That would give you 64 x 64 = 4096 end points per external "vip."ECMP giveth and taketh away. If you change your next hops expect all those flows to scramble. The reason is that ordering of next hops / egress interfaces are generally included in assignment of flows to next hop. In a traditional routing application his has no effect. When the next hops are terminating TCP sessions youll be sending RSTs to 1/2 of your flows.For this same reason youll have better luck advertising more specific routes, like /32s instead of a whole /24. This can help limit the blast radius of flow rehash events to a single destination "vip."There are more tricks you can play to mitigate flow rehashes. Its quite a bit of additional complexity though.For the same reason make double plus sure that you dont count the ingress interface in the ECMP hash key. On Junos this is incoming-interface-index and family inet { layer-4 }, IIRC.You really dont want to announce your routes from the same host that will serve your traffic. Separate your control plane and data plane. Its terrible when a host has a gray failure, say oom or read only disk, and route announcements stay up while the host fails to serve traffic. You end up null routing or throwing 500s for 1/Nth of your traffic.

评论 #7813656 未加载

keepperalmost 11 years ago

Half a million dollars of load balancers? Either you are buying from the wrong vendor, or you have some wonky ideas of how many load balancers you need per data center, or are not using them correctly. ( hint: check a10 networks and Zeus)The reality is, that if your problem is only L3, then arguably this can be solved many ways. For example, networks have been doing 10s of gigabits of L3 load balancing using DSR for ages. Dynamic route propagation doesn’t have a hold on this ( albeit it’s more “elegant” ).But most people do more than L3, and really do L4-L7 load balancing, and most modern “application load balancing” platforms are really software packages bundled up in a nice little appliance. This is where packages like varnish with its vcl/vmod’s and caching, Aflex ( from a10 networks ) and Traffic Sscript from Zeus , amongst others, come in. Shuffling bits is the easy part! Understanding the request, and making decisions on that is the harder part.If you split the problem, and are using varnish or nginx as your application load balancer, you can’t claim you’ve gotten rid of them, you were either not buying the right initial platform, or using it correctly. When you put a “stop buying load balancers”… you must first define what you mean by “load balancer” ;)For the record, I've used both commercial load balancing platforms as well as contributed patches and used OSS load balancing platforms..

评论 #7813346 未加载

at-fates-handsalmost 11 years ago

This is actually a great idea. The last company I worked for ran into several issues with load balancing servers. Two out of the three major releases I was around for were an unmitigated disasters.The first was because of old load balancing servers getting bogged down with traffic. CIO got pissed, dropped three million on brand new spanking SSD drives and new "state-of-the-art" servers. Cue to the next release.Pretty much the same issue. It was two lines in a program that was calling a file from Sharepoint - thousands of times a second to the server and bogged down all three of the load balancer servers with traffic within minutes of the release. Took the back-end developers a week and some help from Microsoft to fix the bug. I just sat back and giggled since the CIO spent two hours in a meeting with the whole IT department lecturing them on the importance of load testing immediately after the first releases failure.Needless to say, they didn't do any load testing for the applications either time which contributed to the issue. Of course, it just goes to show even with the bestest, newest hardware, you can still bring your site/applications to its knees.

评论 #7812279 未加载

lazyjonesalmost 11 years ago

This article is a bit low on details, so it's hard to judge the quality of the proposed solution (without having tested a similar setup).We faced the choice of either upgrading our aging Foundry load balancers or building our own solution a few years ago and came up with a very stable and scalable setup:2+ load balances (old web servers sufficed) running Linux and:* wackamole for IP address failover (detects peer failure with very low latency, informs upstream routers; identical setup for all load balancers works, can be tuned to have particular IP addresses preferably on particular load balancer hosts) <a href="http://www.backhand.org/wackamole/" rel="nofollow">http://www.backhand.org/wackamole/</a>* Varnish for HTTP proxying and load balancing (identical setup on all load balancers) - www.varnish.org* Pound for HTTPS load balancing (identical configuration on all load balancers, can handle SNI, client certificates etc. ...) <a href="http://www.apsis.ch/pound/" rel="nofollow">http://www.apsis.ch/pound/</a>This scales pretty much arbitrarily, just add more load balancers for more Varnish cache or SSL/TLS handshakes/second. We also have nameservers on all load balancers (also with replicated configuration and IP address failover). Configuration is really easy, only Varnish required some tuning (larger buffers etc.) and Pound (OpenSSL really) was set up carefully for PFS and good compatibility with clients.The only drawback is that actual traffic distribution over the load balancers is arbitrary and thus unbalanced (wackamole assigns the IP addresses randomly unless configured to prefer a particular distribution), but the more IP addresses your traffic is spread out over, the less of a problem this becomes.

评论 #7812992 未加载

评论 #7814402 未加载

peterwwillisalmost 11 years ago

Really he's talking about layer 4 load balancing, not 3, and assuming your juniper router has an Internet Processor II ASIC to juggle tcp flows. You're still buying hardware to do the load balancing, you just use software to do the BGP announce.Honestly it all seems a bit crude and unreliable. If i'm writing a software load balancer i'm not going to use curl, bash scripts and pipes to do it. But this is why devops people shouldn't be designing highly available traffic control software.

评论 #7812037 未加载

mjolkalmost 11 years ago

This is a cool setup, but with the caveat that Allan stated, it forces you to think a little more about a layer that most systems people are less experienced in. The software approach is particularly useful because one could take the "healthcheck" setup and have it keep your alerting/dashboards in sync with reality (e.g. do healthcheck, fork: return exit code; POST {$hostname: 'ok'} to metric collector).I also see that Shutterstock is actively hiring. For anyone looking, Shutterstock is a great place to work and employs some really brilliant people.Disclaimer: Ex-Shutterstock employee

jdubsalmost 11 years ago

I can boil water for tea in my oven, but should I?This is totally nonstandard and it will be a nightmare to document and also difficult to hand this off to a new team mate.

评论 #7812716 未加载

评论 #7812197 未加载

评论 #7812242 未加载

评论 #7812539 未加载

评论 #7812239 未加载

评论 #7812123 未加载

transitorykrisalmost 11 years ago

This can shift complexity to elsewhere in your stack. A couple points to add.Be mindful of the specific routing hardware you're using:Announcing and withdrawing prefixes can cause the router to select new next hops (i.e. servers). This is mostly a problem with TCP and other connection oriented protocols (or even connectionless if you're expecting a client to be sticky to a server).You may also lose the ability to do un-equal cost load balancing.

评论 #7812255 未加载

jaueralmost 11 years ago

You can also do this (Equal Cost MultiPath to servers) without a dynamic routing protocol but you are at the mercy of whatever health checks your top of rack switch supports.On Cisco switches you can use a IP SLA check to monitor for DNS replies from a DNS server and then have a static route that tracks the SLA check. If your DNS server stops responding the route would be withdrawn and traffic routed away. This can happen within a few seconds. Slides from a NANOG talk about this (PDF): <a href="http://www.nanog.org/meetings/nanog41/presentations/Kapela-lightning.pdf" rel="nofollow">http://www.nanog.org/meetings/nanog41/presentations/Kapela-l...</a>

nonubyalmost 11 years ago

>"it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers."This strikes me as expensive, does this mean packets no longer pass through the ASIC only side of a router and thus the software in the router has to do some of the heavy lifting, thus limiting the capacity/throughout to a mere fraction of what the route is really capable of?disclaimer: i have only a high-level overview of router tech

dmouratialmost 11 years ago

Been running software load balancers for over a decade. I started with LVS (Linux Virtual Server) now called ivps. Now we run HAProxy and we're looking Apache Traffic Server.Some of the load balancers have even run BGP as called out in the OP. Nothing really fancy but enough to be interesting.One of the coolest things I built was a Global Server Load Balancer to balance Load Balancers. We needed it initially to move data centers. It was built on top of PowerDNS and ketama hash.

vhost-almost 11 years ago

As someone who works with NetScalers on the regular, I have to say I love this idea. Citrix support is terrible and NetScalers are such a pain to configure. Then I see the bill and it frosts the cake.We recently upgraded from version 9 to version 10 and it took down our production site because of some asinine undocumented rate limiting they "finally enforced" in version 10.I'd like to play with software load balancing in the testing facility.

mseebachalmost 11 years ago

Even though the above says load-balance per-packet, it is actually more of a load-balance per-flow since each TCP session will stick to one route rather than individual packets going to different backend servers. As far as I can tell, the reasoning for this stems from legacy chipsets that did not support a per-flow packet distribution.Is this not fairly risky? It's essentially relying on a bug?

contingenciesalmost 11 years ago

Yeah, I spent the last weekend configuring Cisco gear which basically I feel should have been done in software on Linux. The era of the hardware firewall slash load balancer is over as far as hardware goes. Buy a dedicated box (or two) and configure .. it's faster and more predictable/reliable.

techprotocolalmost 11 years ago

AWS's offering which is software based - <a href="http://aws.amazon.com/elasticloadbalancing/" rel="nofollow">http://aws.amazon.com/elasticloadbalancing/</a>

评论 #7812064 未加载