Finding a problem at the bottom of the Google stack

325 pointsby 9nGQluzmnq3Mabout 5 years ago

18 comments

A small tilt like that wouldn't impact most water cooling systems. Most cooling systems would run with a 45 degree tilt, and some are completely sealed and would work any way up.I suspect this isn't a water cooling problem, but instead a heat pipe system, with a phase change material in (often butane). Heat pipes are used to conduct heat from CPU's to heatsinks in most laptops. They have a liquid in which boils, and then condenses at the other end of the pipe, and then flows back again as liquid. Heat pipes usually look like a thick bar of copper, but are in fact way more thermally conductive than copper.The inside of the pipe usually has a felt-like material to 'wick' water from the wet end to the dry end, but wicking is quite slow compared to guaranteeing the pipe is perfectly level and using gravity to just let the liquid flow downhill.I'm 99% sure that's the reason this system doesn't work with a slight slope.

评论 #22585953 未加载

评论 #22587828 未加载

tgtweakabout 5 years ago

We had a very similar situation recently where the colocation facility had replaced some missing block-out panels in a rack and it caused the top-of-rack switches to recycle hot air... the system temps of both switches were north of 100°C and to their credit (dell/force10 s4820Ts) they ran flawlessly and didn't degrade any traffic, sending appropriate notices to the reporting email.Something as benign as that can take out an entire infra if unchecked.I've seen racking fail in the past (usually someone loading a full 42U with high density disk systems putting the rack over weight capacity by a notable factor) and it is definitely a disaster situation. One datacenter manager recounted a story of a rack falling through the raised flooring back in the days when that was the standard (surprise - it kept running until noticed by a tech on a walkthrough).Good story but comes across as back-patting a bit.

评论 #22585217 未加载

评论 #22587582 未加载

m0zgabout 5 years ago

The most epic bug I've seen investigated at Google was a CPU bug in the CPUs that Intel made specifically for Google at the time. Only some chips had the bug. And only some cores on those chips had it, and to make matters worse it'd only manifest non-deterministically. I don't recall the full details, nor would I share them even if I did, but what struck me is that this mindset ("at Google one-in-a-million events happen all the time") and software engineering mindset that goes with it (checksumming all over the place, overbuilding for fault tolerance and data integrity) was basically the only reason why this bug was identified and fixed. In a company with lesser quality of engineering staff it would linger forever, silently corrupting customer data, and the company would be totally powerless to do anything about it. In fact they probably wouldn't even notice it until it's far too late. At Google, not only they noticed it right away, they also were able to track it all the way down to hardware (something that software engineers typically consider infallible) in a fairly short amount of time.

Nextgridabout 5 years ago

I wonder why it took as far as having the actual software running on the machine to fail and have user-facing consequences for them to notice that something was wrong. With all that bragging about how good they are, why didn't they have alerts that would let them know the temperature was higher than normal before it got to a level critical enough to affect software operation?

评论 #22586238 未加载

评论 #22584152 未加载

评论 #22585390 未加载

评论 #22584429 未加载

评论 #22585560 未加载

评论 #22584188 未加载

评论 #22584629 未加载

评论 #22584159 未加载

评论 #22584163 未加载

评论 #22586905 未加载

评论 #22584179 未加载

dilippkumarabout 5 years ago

This was a fun story. Could've been told in a more engaging way, but still a good read.I can't imagine a better way the phrase "bottom of the Google stack" could have been used. That phrase can now be retired.

SahAssarabout 5 years ago

> In this event, an SRE on the traffic and load balancing team was alerted that some GFEs (Google front ends) in Google's edge network, which statelessly cache frequently accessed content, were producing an abnormally high number of errorsWhat does "statelessly cache" mean? Stateless means it has no state, and cache means it saves frequently requested operations. How can it save anything without state?

评论 #22584759 未加载

评论 #22587121 未加载

评论 #22585350 未加载

评论 #22585710 未加载

nwallinabout 5 years ago

This is great work, but I feel like the wrong thing triggered the alert.The temperature spike should have been the first alert. The throttling should have been the second alert. The high error count should have been third.If you're thermal throttling, you have many problems that could give give all sorts of puzzling indications.

评论 #22585122 未加载

评论 #22585013 未加载

评论 #22585682 未加载

评论 #22585169 未加载

thedanceabout 5 years ago

The most interesting part of this piece is how it strongly implies that the BGP announcement thing in question is single-threaded. The "overloaded" machine is showing utilization just below 1 CPU, while the normal ones are loafing along at < 10% of 1 CPU.

评论 #22591808 未加载

nitwit005about 5 years ago

Only hours later did I realize the title of this was a pun.

nn3about 5 years ago

So they didn't monitor temperature of all systems by default to catch cooling problems?Sounds more embarrassing to me.That's fairly basic mistakes.

评论 #22585344 未加载

评论 #22585176 未加载

perlgeekabout 5 years ago

It's cool to see them track down the errors like this, but I'd like to point out some weird things along the way:* why do the racks have wheels at all? Doesn't seem like a standard build, and turns out the be risky* there should be at least daily checks on the data center, including a visual inspection of cooling systems and the likes. I don't know if daily visual inspection of racks is also a thing, but should find that pretty quickly.* monitoring temperatures in a data center is pretty essential, though I must admit I don't know enough if rack-level temperature monitoring would have caused overheating of CPUs in the rack.

评论 #22586188 未加载

CoffeeDregsabout 5 years ago

This is a fun read but should read as a description of a fairly common method of developing products and remediating "deviations". A lot of words are spent on describing what is essentially a root-cause-analysis (see GxP, ISO 9000, fishbone diagram). Hopefully, you thought "check", "check", "check", as you read it.If you thought "couldn't Bob of shimmed it with a block of plywood?", you might want to read up on continuous-improvement. Have Bob put the shim in to fix the problem right quick, then start up the continuous improvement engine...

williamDafoeabout 5 years ago

Why doesn't Google have a visualization system for temperature of the racks, with monitoring and alarms? It seems the problem went undetected because SREs did a poor job to begin with .... No DC has more than a few tens of thousands of machines which is easily handle by 3 borgmon or 100 monarch machines LOL ...

aeyesabout 5 years ago

> All incidents should be novel.Ah, how nice would it be if every company had infinite engineering capacity.

bluedinoabout 5 years ago

Do people do not visit their datacenters often enough to notice a tilted rack?

评论 #22588079 未加载

评论 #22589006 未加载

londons_exploreabout 5 years ago

Another way to tell the same story:"Someone at Google bought cheap casters designed to hold up an office table, and put them in the bottom of a server rack that weighed half a tonne. They failed. Tens of thousands of dollars were spent replacing them all"

评论 #22585504 未加载

评论 #22586250 未加载

评论 #22585569 未加载

jeffrallenabout 5 years ago

Wow, I was really NOT expecting that.

rdxmabout 5 years ago

this is awesome. one of the by-products of the public cloud era is a loss of the having to consider the physical side of things when considering operational art.

18 comments

londons_exploreabout 5 years ago

评论 #22585953 未加载

评论 #22587828 未加载

tgtweakabout 5 years ago

评论 #22585217 未加载

评论 #22587582 未加载

m0zgabout 5 years ago

Nextgridabout 5 years ago

评论 #22586238 未加载

评论 #22584152 未加载

评论 #22585390 未加载

评论 #22584429 未加载

评论 #22585560 未加载

评论 #22584188 未加载

评论 #22584629 未加载

评论 #22584159 未加载

评论 #22584163 未加载

评论 #22586905 未加载

评论 #22584179 未加载

dilippkumarabout 5 years ago

SahAssarabout 5 years ago

评论 #22584759 未加载

评论 #22587121 未加载

评论 #22585350 未加载

评论 #22585710 未加载

nwallinabout 5 years ago

评论 #22585122 未加载

评论 #22585013 未加载

评论 #22585682 未加载

评论 #22585169 未加载

thedanceabout 5 years ago

评论 #22591808 未加载

nitwit005about 5 years ago

Only hours later did I realize the title of this was a pun.

nn3about 5 years ago

So they didn't monitor temperature of all systems by default to catch cooling problems?Sounds more embarrassing to me.That's fairly basic mistakes.

评论 #22585344 未加载

评论 #22585176 未加载

perlgeekabout 5 years ago

评论 #22586188 未加载

CoffeeDregsabout 5 years ago

williamDafoeabout 5 years ago

aeyesabout 5 years ago

> All incidents should be novel.Ah, how nice would it be if every company had infinite engineering capacity.

bluedinoabout 5 years ago

Do people do not visit their datacenters often enough to notice a tilted rack?

评论 #22588079 未加载

评论 #22589006 未加载

londons_exploreabout 5 years ago

评论 #22585504 未加载

评论 #22586250 未加载

评论 #22585569 未加载

jeffrallenabout 5 years ago

Wow, I was really NOT expecting that.

rdxmabout 5 years ago

this is awesome. one of the by-products of the public cloud era is a loss of the having to consider the physical side of things when considering operational art.