The corollary of this post is "things we've been monitoring and/or alerting on which we shouldn't have been".<p>Starting at a new shop, one of the first things I'll do is:<p>1. Set up a high-level "is the app / service / system responding sanely" check which lets me know, from the top of the stack, whether or not everything else is or isn't functioning properly.<p>2. Go through the various alerting and alarming systems and generally dialing the alerts <i>way</i> back. If it's broken at the top, or if some vital resource is headed to the red, let me know. But if you're going to alert based on a cascade of prior failures (and DoS my phone, email, pager, whatever), then STFU.<p>In Nagios, setting relationships between services and systems, for alerting services, setting thresholds appropriately, etc., is key.<p>For a lot of thresholds you're going to want to find out <i>why</i> they were set to what they were and what historical reason there was for that. It's like the old pot roast recipe where Mom cut off the ends of the roast 'coz that's how Grandma did it. Not realizing it was because Grandma's oven was too small for a full-sized roast....<p>Sadly, that level of technical annotation is often lacking in shops, especially where there's been significant staff turnover through the years.<p>I'm also a fan of some simple system tools such as sysstat which log data that can then be graphed for visualization.