> Err on the side of removing noisy alerts – over-monitoring is a harder problem to solve than under-monitoring.<p>Absolutely this. Our team is having more problems with this issue than anything else. However, there are two points which seem to contradict:<p><pre><code> - Pages should be [...] actionable
- Symptoms should be monitored, not causes
</code></pre>
The problem is that can't act on symptoms, only research them and then act on the causes. If you get an alert that says the DB is down, that's an actionable page - start the DB back up. Whereas, being paged that the connections to the DB are failing is far less actionable - you have to spend precious downtime researching the actual cause first. It could be the network, it could be an intermediary proxy, or it could be the DB itself.<p>Now granted, if you're only catching causes, there is the possibility you might miss something with your monitoring, and if you double up on your monitoring (that is, checking symptoms as well as causes), you could get noise. That said, most monitoring solutions (such as Nagios) include dependency chains, so you get alerted on the cause, and the symptom is silenced while the cause is in an error condition. And if you missed a cause, you still get the symptom alert and can fill your monitoring gaps from there.<p>Leave your research for the RCA and following development to prevent future downtime. When stuff is down, a SA's job is to get it back up.