TechEcho

13 comments

xtacyover 12 years ago

Nice writeup, but it leaves me curious about the root cause:For some reason, our switches were unable to learn a significant percentage of our MAC addresses and this aggregate traffic was enough to saturate all of the links between the access and aggregation switches, causing the poor performance we saw throughout the day.Did you work with your vendor to understand what caused the above problem? Was it a lack of number of entries in the MAC table?This problem aside, I am wondering why you still run layer 2 network in a tree-like configuration. These are known not to scale well, beyond a small LAN. An appropriate layer 3 network (with multipath routing) would ensure there is no such flooding, and ensure you use all the precious capacity in your switches!

评论 #4878270 未加载

akoumjianover 12 years ago

"We experienced 18 minutes of complete unavailability along with sporadic bursts of slow responses and intermittent errors for the entire day."Well, I can say we experienced worse than that. Our private repositories were unavailable starting at 9am until 4pm PST.

评论 #4878178 未加载

zyztemover 12 years ago

Welcome to the club of STP meltdown survivors!Unfortunately, large L2 Ethernet networks is not scalable and prone to episodic catastrophic failure. You can read more here: <a href="http://blog.ioshints.info/2012/05/transparent-bridging-aka-l2-switching.html" rel="nofollow">http://blog.ioshints.info/2012/05/transparent-bridging-aka-l...</a>One way to make L2 network somewhat stable is to replace many little switches with one big modular chassis for hundreds ports, like Cat6500/BlackDiamond.Or, to minimize L2 segments and connect between them in L3 (IP routing).

评论 #4878956 未加载

ajtaylorover 12 years ago

This is a great writeup! I love reading these types of postmortems because I always end up learning something new.

ChuckMcMover 12 years ago

Wow, I totally empathize with that pain. Let me guess, you've got switches from Blade Networks (aka IBM) ? :-)Large scale networking changes like this are so challenging to pull off, one missed cable, one mis-configured switch, and blam! everything is nuts.

评论 #4878854 未加载

评论 #4878605 未加载

jauerover 12 years ago

> Last week the new aggregation switches finally arrived and were installed in our datacenter.It sounds like you rushed these switches into production, maybe with insufficient testing.There are all kinds of bugs and weird interactions in network hardware and software that cause problems that you can't anticipate.You have got to lab it up and do sanity checks before deploying (referring specifically to the lacp/port chan problems).

评论 #4878400 未加载

评论 #4878618 未加载

raidesover 12 years ago

The excuse that the application grew faster than scalable is amateur hour. This entire article makes any true sys engineer cringe in their stomach.You are ~100 million dollar company and it seems like you drew your systems architecture with crayons. The article is upsetting. The lack of segmentation is embarrassing."oh it's the switch's fault, it doesn't learn MACs fast enough" - actually you could subnet your racks and use f*n vlans. You might use public Ips on everything but this could still be educational for the company.Your solution to all of this was to spend twice as much on a "staging" network. Something doesn't seem right here.It makes me cringe when I see any one sentence that has the following three words in it: escalate, network, vendor.This isn't a boeing airplane, you cannot just rely on the vendor. This article just gives me a good sense of job security in the field of sys engineering. I really think that they should sit down and really go over their network. A bridge loop like this for a company this large is pretty amateur. Github you can do so much better.

评论 #4887694 未加载

dkhenryover 12 years ago

[shameless plug]Hey github, sounds like you need SevOne. You could have diagnosed this issue with one TopN report and been done with it.[/shameless plug]edit: See following thread for a full explanation .

评论 #4878898 未加载

dkhenryover 12 years ago

This is the kind of situation that I think screams for OpenFlow[1]. It seems issues like this would be easier to avoid and faster to troubleshoot.1. <a href="http://www.openflow.org/" rel="nofollow">http://www.openflow.org/</a>

评论 #4878927 未加载

评论 #4879591 未加载

评论 #4879572 未加载

cagenutover 12 years ago

I've got a pet theory here that this is going to be a trend over the next few years. A lot of companies github's age were built on the "we misinterpreted devops as noops" attitude, which works great for a few years, but somewhere in the year 3 - 5 range the entropy and technical debt compound faster than a non existent or small/inexperienced ops team can keep up with.

评论 #4879597 未加载

评论 #4878926 未加载

评论 #4879040 未加载

ctimeover 12 years ago

This is Cisco Nexus gear with Bridge Assurance enabled, probably 5K to 7K uplinks, IMHO

评论 #4879091 未加载

brooksbpover 12 years ago

Were you running LACP on the LAGs?

评论 #4878277 未加载

hcarvalhoalvesover 12 years ago

GitHub is unresponsive as today, again.

评论 #4878096 未加载

评论 #4878140 未加载

13 comments

xtacyover 12 years ago

评论 #4878270 未加载

akoumjianover 12 years ago

评论 #4878178 未加载

zyztemover 12 years ago

评论 #4878956 未加载

ajtaylorover 12 years ago

This is a great writeup! I love reading these types of postmortems because I always end up learning something new.

ChuckMcMover 12 years ago

评论 #4878854 未加载

评论 #4878605 未加载

jauerover 12 years ago

评论 #4878400 未加载

评论 #4878618 未加载

raidesover 12 years ago

评论 #4887694 未加载

dkhenryover 12 years ago

评论 #4878898 未加载

dkhenryover 12 years ago

评论 #4878927 未加载

评论 #4879591 未加载

评论 #4879572 未加载

cagenutover 12 years ago

评论 #4879597 未加载

评论 #4878926 未加载

评论 #4879040 未加载

ctimeover 12 years ago

This is Cisco Nexus gear with Bridge Assurance enabled, probably 5K to 7K uplinks, IMHO

Network problems last Friday

13 comments

Network problems last Friday

13 comments