TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Network problems last Friday

84 pointsby silentehover 12 years ago

13 comments

xtacyover 12 years ago
Nice writeup, but it leaves me curious about the root cause:<p>For some reason, our switches were unable to learn a significant percentage of our MAC addresses and this aggregate traffic was enough to saturate all of the links between the access and aggregation switches, causing the poor performance we saw throughout the day.<p>Did you work with your vendor to understand what caused the above problem? Was it a lack of number of entries in the MAC table?<p>This problem aside, I am wondering why you still run layer 2 network in a tree-like configuration. These are known not to scale well, beyond a small LAN. An appropriate layer 3 network (with multipath routing) would ensure there is no such flooding, and ensure you use all the precious capacity in your switches!
评论 #4878270 未加载
akoumjianover 12 years ago
"We experienced 18 minutes of complete unavailability along with sporadic bursts of slow responses and intermittent errors for the entire day."<p>Well, I can say we experienced worse than that. Our private repositories were unavailable starting at 9am until 4pm PST.
评论 #4878178 未加载
zyztemover 12 years ago
Welcome to the club of STP meltdown survivors!<p>Unfortunately, large L2 Ethernet networks is not scalable and prone to episodic catastrophic failure. You can read more here: <a href="http://blog.ioshints.info/2012/05/transparent-bridging-aka-l2-switching.html" rel="nofollow">http://blog.ioshints.info/2012/05/transparent-bridging-aka-l...</a><p>One way to make L2 network somewhat stable is to replace many little switches with one big modular chassis for hundreds ports, like Cat6500/BlackDiamond.<p>Or, to minimize L2 segments and connect between them in L3 (IP routing).
评论 #4878956 未加载
ajtaylorover 12 years ago
This is a great writeup! I love reading these types of postmortems because I always end up learning something new.
ChuckMcMover 12 years ago
Wow, I totally empathize with that pain. Let me guess, you've got switches from Blade Networks (aka IBM) ? :-)<p>Large scale networking changes like this are so challenging to pull off, one missed cable, one mis-configured switch, and blam! everything is nuts.
评论 #4878854 未加载
评论 #4878605 未加载
jauerover 12 years ago
&#62; Last week the new aggregation switches finally arrived and were installed in our datacenter.<p>It sounds like you rushed these switches into production, maybe with insufficient testing.<p>There are all kinds of bugs and weird interactions in network hardware and software that cause problems that you can't anticipate.<p>You have got to lab it up and do sanity checks before deploying (referring specifically to the lacp/port chan problems).
评论 #4878400 未加载
评论 #4878618 未加载
raidesover 12 years ago
The excuse that the application grew faster than scalable is amateur hour. This entire article makes any true sys engineer cringe in their stomach.<p>You are ~100 million dollar company and it seems like you drew your systems architecture with crayons. The article is upsetting. The lack of segmentation is embarrassing.<p>"oh it's the switch's fault, it doesn't learn MACs fast enough" - actually you could subnet your racks and use f*n vlans. You might use public Ips on everything but this could still be educational for the company.<p>Your solution to all of this was to spend twice as much on a "staging" network. Something doesn't seem right here.<p>It makes me cringe when I see any one sentence that has the following three words in it: escalate, network, vendor.<p>This isn't a boeing airplane, you cannot just rely on the vendor. This article just gives me a good sense of job security in the field of sys engineering. I really think that they should sit down and really go over their network. A bridge loop like this for a company this large is pretty amateur. Github you can do so much better.
评论 #4887694 未加载
dkhenryover 12 years ago
[shameless plug]<p>Hey github, sounds like you need SevOne. You could have diagnosed this issue with one TopN report and been done with it.<p>[/shameless plug]<p>edit: See following thread for a full explanation .
评论 #4878898 未加载
dkhenryover 12 years ago
This is the kind of situation that I think screams for OpenFlow[1]. It seems issues like this would be easier to avoid and faster to troubleshoot.<p>1. <a href="http://www.openflow.org/" rel="nofollow">http://www.openflow.org/</a>
评论 #4878927 未加载
评论 #4879591 未加载
评论 #4879572 未加载
cagenutover 12 years ago
I've got a pet theory here that this is going to be a trend over the next few years. A lot of companies github's age were built on the "we misinterpreted devops as noops" attitude, which works great for a few years, but somewhere in the year 3 - 5 range the entropy and technical debt compound faster than a non existent or small/inexperienced ops team can keep up with.
评论 #4879597 未加载
评论 #4878926 未加载
评论 #4879040 未加载
ctimeover 12 years ago
This is Cisco Nexus gear with Bridge Assurance enabled, probably 5K to 7K uplinks, IMHO
评论 #4879091 未加载
brooksbpover 12 years ago
Were you running LACP on the LAGs?
评论 #4878277 未加载
hcarvalhoalvesover 12 years ago
GitHub is unresponsive as today, again.
评论 #4878096 未加载
评论 #4878140 未加载