Car alarms and smoke alarms: tradeoff between sensitivity and specificity (2012)

134 点作者 lngarner大约 2 年前

14 条评论

raldi大约 2 年前

When the oncall gets paged, an SLO should be in jeopardy in a way that requires immediate measures to be taken by a well-trained human as described in actionable terms in a linked playbook.No SLO in jeopardy, or no immediate measure that needs to be taken? Don't page the oncall; send a low-priority ticket for the service owner to investigate the next business day.Steps need to be taken, but they're mechanical in nature or otherwise don't give the SRE an opportunity to exercise their brain in an interesting fashion? Replace the alert with an automated handler that only pages the oncall if it encounters an exception.No playbook, or the playbook consists of useless non-actionable items like, "This alert means the service is running out of frobs"? Write a playbook that explains what the oncall is expected to do when the service needs frobs.Edit: A dead reply asks if I've ever experienced a novel incident. Of course. Say, for instance, a "This should never happen" error-level log is suddenly happening like crazy, for the first time ever. In that case, you page the oncall, they do their best to debug it, see if they can reach the SWE service owners, read through the code to see if it could be an indicator that SLOs are being violated (e.g., user data corruption) or might be violated soon, and then write a stub playbook to be fleshed out the next business day, probably alongside a code change to handle this situation without spamming the logs so much.

评论 #35531990 未加载

评论 #35544190 未加载

评论 #35531670 未加载

sammalloy大约 2 年前

The entire car alarm industry is a scam, promoted by Republican congressman Darrell Issa. It has seriously disrupted our lives in every way imaginable and has drowned out the beauty of nature. I can’t think of a single car that has been protected by a car alarm since they were invented. They are useless and should be banned for the health and safety of mitigating noise pollution.

评论 #35531856 未加载

评论 #35532579 未加载

评论 #35540162 未加载

评论 #35531993 未加载

评论 #35535567 未加载

dfox大约 2 年前

Smoke(/fire in general) alarms are not a good example of a thing with high specificity. You perceive it that way, but what you see is the result of somebody getting paged about it and then checking (preferably physically, but also through eg. CCTV) whether there really is an emergency situation and canceling the alarm before its escalation timeout. Apparently, for typical commercial building false fire alarms are more or less an weekly occurrence.Edit: in large scale fire alarm systems there also are rules about combinations of triggered sensors that cause immediate escalation (if there is smoke and elevated temperature in two adjacent zones, it probably is not a false alarm and such things, often it even takes into account the failure modes of the physical alarm loop wiring). This is an interesting idea for IT monitoring: page someone only when multiple metrics indicate an issue.

评论 #35531876 未加载

mertd大约 2 年前

The post is somewhat incomplete without also discussing the cost of the wrong decision.You obey the smoke alarm because the cost of ignoring the alarm when it is a true positive is potentially infinite (you die). You ignore the car alarm because (1) most likely it is a false positive but also (2) most likely it is somebody else's car.

compumike大约 2 年前

I do like how the author presents the case for how damaging false-positives can be in SRE monitoring. But, FYI, it can get worse if these monitors are hooked to self-actuating feedback loops! I recently wrote about a production incident on the Heii On-Call blog, in the context of witnessing how Kubernetes liveness probes and CPU limits worked together to create a self-reinforcing CrashLoopBackOff. [1] Partially because the liveness probe thresholds (timeoutSeconds and failureThreshold fields) were too aggressive.We have a similar message about setting monitoring thresholds in our documentation [2] because users have to explicitly specify a downtime timeout before they’re alerted about their website / API endpoint / cron job being down. The timeout / "grace period" is necessary because in many cases a failure is some transient network glitch which will fix itself before a human is alerted.If you make the timeout too short, you’ll get lots of false positive alerts, and as the article says, your on-call engineers will be overwhelmed or just start ignoring the alerts.If you make the timeout too long, it just takes that many minutes of downtime longer before you find out about it.It may sound counterintuitive, but the latter is usually preferable. :)[1] <a href="https://heiioncall.com/blog/kubernetes-liveness-probes-and-cpu-limits-risks-self-reinforcing-crashloopbackoff" rel="nofollow">https://heiioncall.com/blog/kubernetes-liveness-probes-and-c...</a>[2] <a href="https://heiioncall.com/docs" rel="nofollow">https://heiioncall.com/docs</a>

评论 #35533599 未加载

cbarrick大约 2 年前

I think this article is missing the forest for the trees.The article is about finding the appropriate sensitivity of alerts on some signal in order to maximize the predictive value.But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.The article mentions load-average as an example signal, but to me, that's a poor signal to monitor. Instead, if your SLO is defined for error rate, alert on error rate.Alerts on your SLO will have a high predictive value for predicting violations of your SLO, by definition. The tunable parameter here is the time window, not the threshold. E.g. if your error budget is defined for a 30d window, you may want alerts at the SLO threshold for 24h and 1h windows.Alert on causes, not symptoms.

评论 #35531385 未加载

rtkwe大约 2 年前

It's a constant pain of mine to try to get people to stop having business as usual or successfully completed $PROCESS emails come out of our batch processes on our teams at work. They absolutely drown my inbox so I'm forced to filter them then the actual failures get buried in the unchecked "batch spam" folders.

评论 #35532431 未加载

评论 #35531538 未加载

评论 #35530831 未加载

tra3大约 2 年前

I need to sit down and go through the math again, I got lost in the middle somewhere. All I know is our alerts are way too noisy now to the point where they are useless.

评论 #35531518 未加载

评论 #35531644 未加载

评论 #35531249 未加载

bdamm大约 2 年前

Both of these are exactly the kind of problem where our AI future is going to deliver cost effective modern alternatives. Primitive sensors wake up more sophisticated analyzers and use deep sensors (including video) to determine if there is a real problem.Witness companies like Rivian triggering car alarms on aggressive behavior detected from ML on video. Don't even need to touch the car.

gmuslera大约 2 年前

Some complementary reading could be My Philosophy on Alerting ( <a href="https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q" rel="nofollow">https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa...</a> ) and <a href="https://how.complexsystems.fail/" rel="nofollow">https://how.complexsystems.fail/</a>In any case, not all signals are the same. Most systems have a lot of components interacting and what turns to be dangerous is usually a combination of factors, but in the end, what defines that it was or not is that the system is doing what it should. You can put some guessing thresholds, but you must contrast it with that the system works.And they should be actionable too, at least for alerts instead of slow day notifications, or metrics giving context to perceived problems that could take out the guessing from the thresholds.

yafbum大约 2 年前

I'd like to know more about the chip designer who, perhaps unwittingly, created the alarm-filled soundscape of most American cities <a href="https://youtu.be/tmCnleSBAIg" rel="nofollow">https://youtu.be/tmCnleSBAIg</a>. Would love to know more about the composition process that went into it.

Syonyk大约 2 年前

Now, if you're annoyed by the false positive rate on your actual smoke alarms, go replace the one nearest your kitchen with a photoelectric type, not the standard ionization type that's cheaper, the default style installed, and ought to be illegal in homes (IMO).There's been quite a bit of research done, generally easy to find if you look, that talks about the difference and tests them, but the short summary:- Ionization type sensors detect the products of fast flaming combustion and "things cooking in the kitchen." Your oven, if a bit dirty, will reliably trip an ionization type. They are quick on the draw for this. The downside is that they're very, very poor at detecting the sort of slow, smoking, smoldering combustion that is associated with house fires that kill people in the middle of the night.- The photoelectric type is very good at detecting smoke in the air - but it isn't nearly as prone to false triggers on ovens, a burner burning some spills off, etc.They've been A/B tested in a wide variety of conditions, and in some cases, the ionization type is a bit quicker. In other cases, the ionization type is slower, by time ranges north of half an hour - I've seen some test reports where there was a 45 minute gap, while the photoelectric type was going off, before the ionization type fired!In general, "rapid fires during the day" are somewhat destructive to property, but rarely kill people. If your kitchen catches on fire while you're cooking, it may burn the house down, but generally people are able to get out.The fires that kill people are "slow starting fires during the night" - the sort that smolder for potentially hours, often slowly filling the house with toxic smoke, before actually bursting into open flames. On this sort of fire, the photoelectric type will fire long, long before the ionization type - in some cases, they get around to alarming quite literally "after the occupants are dead from the smoke."Using smoke alarms as a way to talk about monitoring systems is nice, but in terms of actual smoke detectors, get at least a few photoelectric sorts in the main areas of your home.Do not get the "combined sensor" sort, since these tend to be and-gated and the worst of both worlds.Edited to add some resources:A presentation on the matter from a while back by one of the experts in this field: <a href="https://wahigroup.com/Resources/Documents/Ion%20vs%20Photo%20Smoke%20Alarms%20WAHI%20Conference%20Slides%20031415.pdf" rel="nofollow">https://wahigroup.com/Resources/Documents/Ion%20vs%20Photo%2...</a>Another paper: <a href="https://www.semanticscholar.org/paper/Detection-of-Smoke-%3A-Full-Scale-Tests-with-Flaming-Einar/b1227142af0ff8d38df5e2ed69371866442c6859" rel="nofollow">https://www.semanticscholar.org/paper/Detection-of-Smoke-%3A...</a>> Full-scale fire tests are carried out to study the effectiveness of the various types of smoke detectors to provide an early warning of a fire. Both optical smoke detectors and ionization smoke detectors have been used. Alarm times are related to human tenability limits for toxic effects, visibility loss and heat stress. During smouldering fires it is only the optical detectors that provide satisfactory safety. With flaming fires the ionization detectors react before the optical ones. If a fire were started by a glowing cigarette, optical detectors are generally recommended. If not, the response time with these two types of detectors are so close that it is only in extreme cases that this difference between optical and ionization detectors would be critical in saving lives.

评论 #35531948 未加载

yamtaddle大约 2 年前

> When presented with this tradeoff, the path of least resistance is to say “Let’s just keep the threshold lower. We’d rather get woken up when there’s nothing broken than sleep through a real problem.” And I can sympathize with that attitude. Undetected outages are embarrassing and harmful to your reputation. Surely it’s preferable to deal with a few late-night fire drills.> It’s a trap.> In the long run, false positives can — and will often — hurt you more than false negatives. Let’s learn about the base rate fallacy.Not sure about anyone else, but speaking of alarms, this style of writing trips my "self-promoting snake-oil Internet bullshitter" alarm. It's like nails on a damn chalkboard, and if you're writing like this, you've already lost me; however, maybe I ought not be pointing that out, since signals are nice to have.Incidentally, I wasn't sure which way the author was gonna go with the core analogy. My smoke alarms have false-alarmed probably 10x as much as my car alarm, even counting times one of us has hit the alarm button on the fob by accident. I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a freezer, as I have with a smoke alarm.(If I were writing like the author I suppose that last part would have read:"I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a chest freezer.I have, with a smoke alarm."Except also I'd have found a way to use "we" and "you" a bunch.)

评论 #35531259 未加载

评论 #35532378 未加载

评论 #35531076 未加载

deathanatos大约 2 年前

Nothing it the article is wrong, per se, but it all seems awfully disconnected from the realities I see in monitoring and alerting?The end advice is right: you want to build the smoke detector, not the car alarm. But … getting that done, now that's the trick. If the org has car alarms, that's the same org whose PMs will not see the "impact" of that ticket to get the monitoring made ship shape, and that ticket will be backlog-icebox-graveyard'ed.I've had to get a number of "technical" non-technical roles to try to see that, no, the monitoring software can not automatically generate¹ metrics around your application. Yes, you have to actually instrument the code to add those!Then that gets combined with systems that are just … not the best? at their job: Datadog has so many rough corners around statistics, where graphs will alias, graphs will change shape depending on zoom, units are a PITA or nowhere to be seen, etc. Sumo a god-awful UI — literally tabs inside tabs, and I can't copy the URL?! — and barely understand structured logging. Splunk is marginally better. Pagerduty permits only the simplest handling of events — don't limit me to a handful of tailored rules: rules are logic, logic is a function: give me WASM. And I want usable business-hours only alerts².Self-hosted systems are perpetually met with "that's not our core focus", but nobody ever seems to convert the cost of managed monitoring/alert systems into "number of FTE that could be hired to maintain a self-hosted system".(Oddly, the example in the article is a car alarm. Load avg. is, IMO, a useless metric. Better to measure CPU consumption and IOPS consumption, separately, or probably better, more derivative stats around the things doing the IOPS/CPU.)¹yes, most system come with some collectors to get system-level stuff like CPU usage, etc. I mean metrics specific to your application.²PD claims to support them, but in practice, they don't work: alerts received off hours don't alert, true … but they never alert once business resume, either! If you're in an org trying to dig itself out of a mess, you need them to not die in the low-prio pile.(Ugh. Give me ACL systems in these systems that don't suck: PD locks the routing rules behind like "Admin", and security doesn't want to grant the rank and file "Admin", and so 80% of my devs have no idea how the system works because they're not allowed to see how the system works! Give me the ability to do a WMA business-days-only line for diurnal patterns! The list just goes on and on…)