My Philosophy on Alerting: Observations of a Site Reliability Engineer at Google

573 点作者 ismavis超过 10 年前

23 条评论

beat超过 10 年前

This reminds me of an excellent talk my friend Dan Slimmon gave called "Car Alarms and Smoke Alarms". He relates monitoring to the concepts of sensitivity and specificity in medical testing (<a href="http://en.wikipedia.org/wiki/Sensitivity_and_specificity" rel="nofollow">http://en.wikipedia.org/wiki/Sensitivity_and_specificity</a>). Sensitivity is about the likelihood that your monitor will detect the error condition. Specificity is about the likelihood that it will not create false alarms.Think about how people react to smoke alarms versus car alarms. When the smoke alarm goes off, people mostly follow the official procedure. When car alarms go off, people ignore them. Why? Car alarms have very poor specificity.I'd add another layer of car alarms are Not My Problem, but that's just me and not part of Dan's excellent original talk.

评论 #8450734 未加载

评论 #8450449 未加载

评论 #8451209 未加载

评论 #8450622 未加载

评论 #8451995 未加载

评论 #8453591 未加载

评论 #8451240 未加载

falcolas超过 10 年前

> Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.Absolutely this. Our team is having more problems with this issue than anything else. However, there are two points which seem to contradict:<pre><code> - Pages should be [...] actionable - Symptoms should be monitored, not causes </code></pre> The problem is that can't act on symptoms, only research them and then act on the causes. If you get an alert that says the DB is down, that's an actionable page - start the DB back up. Whereas, being paged that the connections to the DB are failing is far less actionable - you have to spend precious downtime researching the actual cause first. It could be the network, it could be an intermediary proxy, or it could be the DB itself.Now granted, if you're only catching causes, there is the possibility you might miss something with your monitoring, and if you double up on your monitoring (that is, checking symptoms as well as causes), you could get noise. That said, most monitoring solutions (such as Nagios) include dependency chains, so you get alerted on the cause, and the symptom is silenced while the cause is in an error condition. And if you missed a cause, you still get the symptom alert and can fill your monitoring gaps from there.Leave your research for the RCA and following development to prevent future downtime. When stuff is down, a SA's job is to get it back up.

评论 #8451622 未加载

评论 #8450758 未加载

评论 #8451486 未加载

评论 #8452292 未加载

评论 #8450456 未加载

评论 #8450537 未加载

praptak超过 10 年前

Having your application reviewed by SREs who are going to support it is a legendary experience. They have no motivation to be gentle.It changes the mindset from "Failure? Just log an error, restore some 'good'-ish state and move on to the next cool feature." towards "New cool feature? What possible failures will it cause? How about improving logging and monitoring on our existing code instead?"

评论 #8451968 未加载

评论 #8451211 未加载

ChuckMcM超过 10 年前

Great writeup. Should be in any operations handbook. One of the challenges I've found has been dynamic urgency, which is to say something is urgent when it first comes up, but now that its known and being addressed it isn't urgent anymore, unless there is something else going on we don't know about.Example you get a server failure which affects a service, and you begin working on replacing that server with a backup, but a switch is also dropping packets and so you are getting alerts on degraded service (symptom) but believe you are fixing that cause (down server) when in fact you will still have a problem after the server is restored. So my challenge is figuring out how to alert on that additional input in a way that folks won't just say "oh yeah, this service, we're working on it already."

评论 #8451367 未加载

jakozaur超过 10 年前

That's harder problem than I originally realized. It's easy to write noisy alerts, super easy to not have them (or not catching some issues).It's hard to tune them so signal to noise ratio will be high.

评论 #8450510 未加载

评论 #8451340 未加载

jonbarker超过 10 年前

Where I work, at a mobile ad network, they put everyone on call on a rotating basis even if they are not devops or server engineers. We use Pager Duty and it works well. Since there is always a primary and secondary on call person, and the company is pretty small and technical, everyone feels "responsible" during their shifts, and at least one person is capable of handling rare, catastrophic events. I often wonder which is more important: good docs on procedures for failure modes or a heightened sense of responsibility. A good analogy may be the use of commercial airline pilots. They can override autopilot, but I am told rarely do. The safest airlines are good at maintaining their heightened sense of vigilance despite the lack of the need for it 99.999% of the time.

leef超过 10 年前

"If you want a quiet oncall rotation, it's imperative to have a system for dealing with things that need timely response, but are not imminently critical."This is an excellent point that is missed in most monitoring setups I've seen. A classic example is some request that kills your service process. You get paged for that so you wrap the service in a supervisor like daemon. The immediate issue is fixed and, typically, any future causes of the service process dying are hidden unless someone happens to be looking at the logs one day.I would love to see smart ways to surface "this will be a problem soon" on alerting systems.

评论 #8450469 未加载

评论 #8450441 未加载

0xbadcafebee超过 10 年前

Most of this appears to be just end-to-end testing, and whether you're alerting on a failure of testing the entire application stack or just individual components. He probably got paged by too many individual alerts versus an actual failure to serve data, which I agree would be annoying.In a previous position, we had a custom ticketing system that was designed to also be our monitoring dashboard. Alerts that were duplicates would become part of a thread, and each was either it's own ticket or part of a parent ticket. Custom rules would highlight or reassign parts of the dashboard, so critical recurrent alerts were promoted higher than urgent recurrent alerts, and none would go away until they had been addressed and closed with a specific resolution log. The whole thing was designed so a single noc engineer at 3am could close thousands of alerts per minute while logging the reason why, and keep them from recurring if it was a known issue. The noc guys even created a realtime console version so they could use a keyboard to close tickets with predefined responses just seconds after they were opened.The only paging we had was when our end-to-end tests showed downtime for a user, which were alerts generated by some paid service providers who test your site around the globe. We caught issues before they happened by having rigorous metric trending tools.

评论 #8450601 未加载

shackattack超过 10 年前

Thanks for posting this! I'm on the product team at PagerDuty, and this lines up with a lot of our thinking on how to effectively design alerting + incident response. I love the line "Pages should be urgent, important, actionable, and real."

评论 #8450569 未加载

gk1超过 10 年前

Here's another good writeup on effective alerting, by a former Google Staff Engineer: <a href="http://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/" rel="nofollow">http://blog.scalyr.com/2014/08/99-99-uptime-9-5-schedule/</a>

Someone1234超过 10 年前

Why does a company the size of Google even have call rotations? Shouldn't they have 24/7 shifts of reliability engineers who can manually call in additional people as and when they're needed?I can totally understand why SMBs have rotations. They have less staff. But a monster corporation? This seems like lame penny pinching. Heck for the amount of effort they're clearly putting into automating these alerts, they could likely use the same wage-hours to just hire someone else for a shift. Heck with an international company like Google they could have UK-based staff monitoring US-based sites overnight and visa-versa. Keep everyone on 9-5 and still get 24 hour engineers at their desks.

评论 #8450562 未加载

评论 #8450405 未加载

评论 #8450355 未加载

评论 #8450614 未加载

评论 #8450910 未加载

评论 #8451111 未加载

评论 #8451641 未加载

评论 #8450594 未加载

评论 #8450568 未加载

ecaron超过 10 年前

Here's the link to it as a PDF for anyone else wanting a printable copy to pin to their wall: https:/docs.google.com/document/export?format=pdf&id=199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q

AloisReitbauer超过 10 年前

Good article. Alerting system unfortunately are still at the same level they where decades ago. Today we work in highly distributed environments that scale dynamically and we finding symptoms is a key problem. That is why a lot of people alert on causes or anomalies. In reality they should just detect them and log them for further dependency analysis once a real problem is found. We for example differentiate between three levels of alerts: infrastructure only, application services and users. Our approach to have NO alerts at all but monitor a ton of potential anomalies. Once these anomalies have user impact we report back problem dependencies.If you are interested you can also get my point of view from my Velocity talk on Monitoring without alerts. <a href="https://www.youtube.com/watch?v=Gqqb8zEU66s" rel="nofollow">https://www.youtube.com/watch?v=Gqqb8zEU66s</a>. If you are interested also check out www.ruxit.com and let me know what you think of our approach.

icco超过 10 年前

This is huge. One of the big things dev teams benefit from bringing an SRE team onto their project is learning things like this and how to run a sustainable oncall rotation.

dhpe超过 10 年前

My startup <a href="http://usetrace.com" rel="nofollow">http://usetrace.com</a> is a web monitoring (+regression testing) tool with the "monitor for your users" philosophy mentioned in Rob's article. Monitoring is done on the application/feature level -> alerts are always about a feature visible to the users.

omouse超过 10 年前

This was very informative, I like the idea of monitoring symptoms that are user-facing rather than causes which are devops/sysadmin/dev-facing. I'm just thankful that my next project doesn't require pager duty.

annnnd超过 10 年前

Can't access the site, seems like there's some quota on docs.google.com... Does anyone have a cached version? (WebArchive can't crawl it due to robots.txt)

0xdeadbeefbabe超过 10 年前

So I guess the author uses a smart phone as a pager, but given his passion for uptime, reliability, latency etc. I wonder if he has experimented with an actual pager.

评论 #8451343 未加载

评论 #8451167 未加载

sabmd超过 10 年前

Any alert should be for a good cause sounds good according to me.

lalc超过 10 年前

I just want to say: HN is bursting with great articles today.

wanted_超过 10 年前

Great article @robewaschuk :)-- Marcin, former Google SRE

zubairismail超过 10 年前

In todays world, 90% of bloggers rely on google for their living

djclegit超过 10 年前

Very cool