A small tilt like that wouldn't impact most water cooling systems. Most cooling systems would run with a 45 degree tilt, and some are completely sealed and would work any way up.<p>I suspect this <i>isn't</i> a water cooling problem, but instead a heat pipe system, with a phase change material in (often butane). Heat pipes are used to conduct heat from CPU's to heatsinks in most laptops. They have a liquid in which boils, and then condenses at the other end of the pipe, and then flows back again as liquid. Heat pipes usually look like a thick bar of copper, but are in fact way more thermally conductive than copper.<p>The inside of the pipe usually has a felt-like material to 'wick' water from the wet end to the dry end, but wicking is quite slow compared to guaranteeing the pipe is perfectly level and using gravity to just let the liquid flow downhill.<p>I'm 99% sure that's the reason this system doesn't work with a slight slope.
We had a very similar situation recently where the colocation facility had replaced some missing block-out panels in a rack and it caused the top-of-rack switches to recycle hot air... the system temps of both switches were north of 100°C and to their credit (dell/force10 s4820Ts) they ran flawlessly and didn't degrade any traffic, sending appropriate notices to the reporting email.<p>Something as benign as that can take out an entire infra if unchecked.<p>I've seen racking fail in the past (usually someone loading a full 42U with high density disk systems putting the rack over weight capacity by a notable factor) and it is definitely a disaster situation. One datacenter manager recounted a story of a rack falling through the raised flooring back in the days when that was the standard (surprise - it kept running until noticed by a tech on a walkthrough).<p>Good story but comes across as back-patting a bit.
The most epic bug I've seen investigated at Google was a CPU bug in the CPUs that Intel made specifically for Google at the time. Only some chips had the bug. And only some cores on those chips had it, and to make matters worse it'd only manifest non-deterministically. I don't recall the full details, nor would I share them even if I did, but what struck me is that this mindset ("at Google one-in-a-million events happen all the time") and software engineering mindset that goes with it (checksumming all over the place, overbuilding for fault tolerance and data integrity) was basically the only reason why this bug was identified and fixed. In a company with lesser quality of engineering staff it would linger forever, silently corrupting customer data, and the company would be totally powerless to do anything about it. In fact they probably wouldn't even notice it until it's far too late. At Google, not only they noticed it right away, they also were able to track it all the way down to hardware (something that software engineers typically consider infallible) in a fairly short amount of time.
I wonder why it took as far as having the actual software running on the machine to fail and have user-facing consequences for them to notice that something was wrong. With all that bragging about how good they are, why didn't they have alerts that would let them know the temperature was higher than normal <i>before</i> it got to a level critical enough to affect software operation?
This was a fun story. Could've been told in a more engaging way, but still a good read.<p>I can't imagine a better way the phrase "bottom of the Google stack" could have been used. That phrase can now be retired.
> In this event, an SRE on the traffic and load balancing team was alerted that some GFEs (Google front ends) in Google's edge network, which statelessly cache frequently accessed content, were producing an abnormally high number of errors<p>What does "statelessly cache" mean? Stateless means it has no state, and cache means it saves frequently requested operations. How can it save anything without state?
This is great work, but I feel like the wrong thing triggered the alert.<p>The temperature spike should have been the first alert. The throttling should have been the second alert. The high error count should have been third.<p>If you're thermal throttling, you have many problems that could give give all sorts of puzzling indications.
The most interesting part of this piece is how it strongly implies that the BGP announcement thing in question is single-threaded. The "overloaded" machine is showing utilization just below 1 CPU, while the normal ones are loafing along at < 10% of 1 CPU.
So they didn't monitor temperature of all systems by default to catch cooling problems?<p>Sounds more embarrassing to me.<p>That's fairly basic mistakes.
It's cool to see them track down the errors like this, but I'd like to point out some weird things along the way:<p>* why do the racks have wheels at all? Doesn't seem like a standard build, and turns out the be risky<p>* there should be at least daily checks on the data center, including a visual inspection of cooling systems and the likes. I don't know if daily visual inspection of racks is also a thing, but should find that pretty quickly.<p>* monitoring temperatures in a data center is pretty essential, though I must admit I don't know enough if rack-level temperature monitoring would have caused overheating of CPUs in the rack.
This is a fun read but should read as a description of a fairly common method of developing products and remediating "deviations". A lot of words are spent on describing what is essentially a root-cause-analysis (see GxP, ISO 9000, fishbone diagram). Hopefully, you thought "check", "check", "check", as you read it.<p>If you thought "couldn't Bob of shimmed it with a block of plywood?", you might want to read up on continuous-improvement. Have Bob put the shim in to fix the problem right quick, then start up the continuous improvement engine...
Why doesn't Google have a visualization system for temperature of the racks, with monitoring and alarms? It seems the problem went undetected because SREs did a poor job to begin with .... No DC has more than a few tens of thousands of machines which is easily handle by 3 borgmon or 100 monarch machines LOL ...
Another way to tell the same story:<p>"Someone at Google bought cheap casters designed to hold up an office table, and put them in the bottom of a server rack that weighed half a tonne. They failed. Tens of thousands of dollars were spent replacing them all"
this is awesome. one of the by-products of the public cloud era is a loss of the having to consider the physical side of things when considering operational art.