Things We Forgot to Monitor

232 点作者 jehiah超过 11 年前

15 条评论

AznHisoka超过 11 年前

Also: 1) Maximum # of open file descriptors2) Whether your slave DB stopped replicating because of some error.3) Whether something is screwed up in your SOLR/ElasticSearch instance so it doesn't respond to search queries, but respond to simple heartbeat pings.4) If your Redis db stopped saving to disk because of lack of space, or not enough memory, or you forgot to set overcommit memory.5) If you're running out of space in a specific partition you usually store random stuff like /var/log.I've had my ass bitten by all of the above :)

评论 #7213269 未加载

评论 #7213201 未加载

评论 #7216739 未加载

评论 #7216277 未加载

otterley超过 11 年前

Swap rate (as opposed to space consumed) is probably the #1 metric that monitoring agents fail to report.One thing that drives me nuts is how frequently monitoring agents/dashboards report and graph only free memory on Linux, which gives misleading results. It's fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what's actually available for applications to use.Another often-overlooked metric that's important for web services in particular is the TCP accept queue depth, per listening port. Once the accept queue is drained, remote clients will get ECONNREFUSED, which is a bad place to be. This value is somewhat difficult to attain, though, because AFAIK Linux doesn't expose it.

评论 #7215505 未加载

评论 #7215678 未加载

评论 #7213928 未加载

评论 #7214570 未加载

bradleyland超过 11 年前

Interestingly, an out-of-the-box Munin configuration on Debian contains nearly all of these. I recommend setting up Munin and having a look at what it monitors by default, even if you don't intend to use it as your monitoring solution.

评论 #7213913 未加载

tantalor超过 11 年前

Some people, when confronted with a problem, think “I know, I'll send an email whenever it happens.” Now they have two problems.

评论 #7214627 未加载

评论 #7216332 未加载

评论 #7214199 未加载

dredmorbius超过 11 年前

The corollary of this post is "things we've been monitoring and/or alerting on which we shouldn't have been".Starting at a new shop, one of the first things I'll do is:1. Set up a high-level "is the app / service / system responding sanely" check which lets me know, from the top of the stack, whether or not everything else is or isn't functioning properly.2. Go through the various alerting and alarming systems and generally dialing the alerts way back. If it's broken at the top, or if some vital resource is headed to the red, let me know. But if you're going to alert based on a cascade of prior failures (and DoS my phone, email, pager, whatever), then STFU.In Nagios, setting relationships between services and systems, for alerting services, setting thresholds appropriately, etc., is key.For a lot of thresholds you're going to want to find out why they were set to what they were and what historical reason there was for that. It's like the old pot roast recipe where Mom cut off the ends of the roast 'coz that's how Grandma did it. Not realizing it was because Grandma's oven was too small for a full-sized roast....Sadly, that level of technical annotation is often lacking in shops, especially where there's been significant staff turnover through the years.I'm also a fan of some simple system tools such as sysstat which log data that can then be graphed for visualization.

jlgaddis超过 11 年前

Be sure to monitor your monitoring system as well (preferably from outside your network/datacenters)! If you don't have anything else in place, you can use Pingdom to monitor one website/server for free [0].I was off work for a few months recently (motorcycle wreck) and removed my e-mail accounts from my phone. Now, I have all my alerts go to a specific e-mail address and those are the only mails I receive on my phone. It has really helped me overcome the problem of ignoring messages.[0]: <a href="https://www.pingdom.com/free/" rel="nofollow">https://www.pingdom.com/free/</a>

comice超过 11 年前

We monitor outgoing smtp and http connections from anything that requires those services.And the best general advice I have is split your alerts into "stuff that I need to know is broken" and "stuff that just helps me diagnose other problems". You don't want to be disturbing your on-call people for stuff that doesn't directly affect your service (or isn't even something you can fix).

mnw21cam超过 11 年前

Also, are your backups working.

jsmeaton超过 11 年前

We had a perfect storm of problems only 2 weeks ago.1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps3a. No one was paying any attention to our server metric graphs / We didn't have good enough "pay attention to these specific graphs because they are currently outside the norm"A very embarrassing day for us that one.We're now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard.

评论 #7215195 未加载

评论 #7214645 未加载

sp332超过 11 年前

You're using icanhazip.com in production? I see from a quick Google search that Puppy Linux seems to use it in some scripts, but how reliable is it?

评论 #7213466 未加载

评论 #7213668 未加载

评论 #7213652 未加载

评论 #7213528 未加载

baruch超过 11 年前

About reboot monitoring, I suggest to use kdump to dump the oops information and save it for later debugging and understanding of the issue. It may even be an uncorrectable memory or pcie error you are seeing and the info is logged in the oops but is hard to figure otherwise. Also, if you consistently hit a single kernel bug you may want to fix it or workaround it.

lincolnpark超过 11 年前

Also, are your API endpoints working properly.

评论 #7216334 未加载

jlgaddis超过 11 年前

I have gear in three different facilities and I'm typically visiting any of them unless I'm installing hardware or replacing it. Shortly after starting at $job, I realized there was no monitoring of the RAID arrays in the servers we have. That could have ended badly.

herokusaki超过 11 年前

How oversold your VPS provider's server is commonly blamed for slowdown but rarely measured.

stephengillie超过 11 年前

Between PRTG and Windows, almost all of that is handled for us. And PRTG can call OMSA by SNMP.

15 条评论

AznHisoka超过 11 年前

评论 #7213269 未加载

评论 #7213201 未加载

评论 #7216739 未加载

评论 #7216277 未加载

otterley超过 11 年前

评论 #7215505 未加载

评论 #7215678 未加载

评论 #7213928 未加载

评论 #7214570 未加载

bradleyland超过 11 年前

评论 #7213913 未加载

tantalor超过 11 年前

Some people, when confronted with a problem, think “I know, I'll send an email whenever it happens.” Now they have two problems.

评论 #7214627 未加载

评论 #7216332 未加载

评论 #7214199 未加载

dredmorbius超过 11 年前

jlgaddis超过 11 年前

comice超过 11 年前

mnw21cam超过 11 年前

Also, are your backups working.

jsmeaton超过 11 年前

评论 #7215195 未加载

评论 #7214645 未加载

sp332超过 11 年前

You're using icanhazip.com in production? I see from a quick Google search that Puppy Linux seems to use it in some scripts, but how reliable is it?

评论 #7213466 未加载

评论 #7213668 未加载

评论 #7213652 未加载

评论 #7213528 未加载

baruch超过 11 年前

lincolnpark超过 11 年前

Also, are your API endpoints working properly.

评论 #7216334 未加载

jlgaddis超过 11 年前

herokusaki超过 11 年前

How oversold your VPS provider's server is commonly blamed for slowdown but rarely measured.

stephengillie超过 11 年前

Between PRTG and Windows, almost all of that is handled for us. And PRTG can call OMSA by SNMP.