Nice writeup, but it leaves me curious about the root cause:<p>For some reason, our switches were unable to learn a significant percentage of our MAC addresses and this aggregate traffic was enough to saturate all of the links between the access and aggregation switches, causing the poor performance we saw throughout the day.<p>Did you work with your vendor to understand what caused the above problem? Was it a lack of number of entries in the MAC table?<p>This problem aside, I am wondering why you still run layer 2 network in a tree-like configuration. These are known not to scale well, beyond a small LAN. An appropriate layer 3 network (with multipath routing) would ensure there is no such flooding, and ensure you use all the precious capacity in your switches!
"We experienced 18 minutes of complete unavailability along with sporadic bursts of slow responses and intermittent errors for the entire day."<p>Well, I can say we experienced worse than that. Our private repositories were unavailable starting at 9am until 4pm PST.
Welcome to the club of STP meltdown survivors!<p>Unfortunately, large L2 Ethernet networks is not scalable and prone to episodic catastrophic failure. You can read more here: <a href="http://blog.ioshints.info/2012/05/transparent-bridging-aka-l2-switching.html" rel="nofollow">http://blog.ioshints.info/2012/05/transparent-bridging-aka-l...</a><p>One way to make L2 network somewhat stable is to replace many little switches with one big modular chassis for hundreds ports, like Cat6500/BlackDiamond.<p>Or, to minimize L2 segments and connect between them in L3 (IP routing).
Wow, I totally empathize with that pain. Let me guess, you've got switches from Blade Networks (aka IBM) ? :-)<p>Large scale networking changes like this are so challenging to pull off, one missed cable, one mis-configured switch, and blam! everything is nuts.
> Last week the new aggregation switches finally arrived and were installed in our datacenter.<p>It sounds like you rushed these switches into production, maybe with insufficient testing.<p>There are all kinds of bugs and weird interactions in network hardware and software that cause problems that you can't anticipate.<p>You have got to lab it up and do sanity checks before deploying (referring specifically to the lacp/port chan problems).
The excuse that the application grew faster than scalable is amateur hour. This entire article makes any true sys engineer cringe in their stomach.<p>You are ~100 million dollar company and it seems like you drew your systems architecture with crayons. The article is upsetting. The lack of segmentation is embarrassing.<p>"oh it's the switch's fault, it doesn't learn MACs fast enough" - actually you could subnet your racks and use f*n vlans. You might use public Ips on everything but this could still be educational for the company.<p>Your solution to all of this was to spend twice as much on a "staging" network. Something doesn't seem right here.<p>It makes me cringe when I see any one sentence that has the following three words in it: escalate, network, vendor.<p>This isn't a boeing airplane, you cannot just rely on the vendor. This article just gives me a good sense of job security in the field of sys engineering. I really think that they should sit down and really go over their network. A bridge loop like this for a company this large is pretty amateur. Github you can do so much better.
[shameless plug]<p>Hey github, sounds like you need SevOne. You could have diagnosed this issue with one TopN report and been done with it.<p>[/shameless plug]<p>edit: See following thread for a full explanation .
This is the kind of situation that I think screams for OpenFlow[1]. It seems issues like this would be easier to avoid and faster to troubleshoot.<p>1. <a href="http://www.openflow.org/" rel="nofollow">http://www.openflow.org/</a>
I've got a pet theory here that this is going to be a trend over the next few years. A lot of companies github's age were built on the "we misinterpreted devops as noops" attitude, which works great for a few years, but somewhere in the year 3 - 5 range the entropy and technical debt compound faster than a non existent or small/inexperienced ops team can keep up with.