TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Things We Forgot to Monitor

232 点作者 jehiah超过 11 年前

15 条评论

AznHisoka超过 11 年前
Also: 1) Maximum # of open file descriptors<p>2) Whether your slave DB stopped replicating because of some error.<p>3) Whether something is screwed up in your SOLR&#x2F;ElasticSearch instance so it doesn&#x27;t respond to search queries, but respond to simple heartbeat pings.<p>4) If your Redis db stopped saving to disk because of lack of space, or not enough memory, or you forgot to set overcommit memory.<p>5) If you&#x27;re running out of space in a specific partition you usually store random stuff like &#x2F;var&#x2F;log.<p>I&#x27;ve had my ass bitten by all of the above :)
评论 #7213269 未加载
评论 #7213201 未加载
评论 #7216739 未加载
评论 #7216277 未加载
otterley超过 11 年前
Swap rate (as opposed to space consumed) is probably the #1 metric that monitoring agents fail to report.<p>One thing that drives me nuts is how frequently monitoring agents&#x2F;dashboards report and graph only free memory on Linux, which gives misleading results. It&#x27;s fine to report it, but to make sense of it, you have to stack free memory along with cached and buffered memory, if you care about what&#x27;s actually available for applications to use.<p>Another often-overlooked metric that&#x27;s important for web services in particular is the TCP accept queue depth, per listening port. Once the accept queue is drained, remote clients will get ECONNREFUSED, which is a bad place to be. This value is somewhat difficult to attain, though, because AFAIK Linux doesn&#x27;t expose it.
评论 #7215505 未加载
评论 #7215678 未加载
评论 #7213928 未加载
评论 #7214570 未加载
bradleyland超过 11 年前
Interestingly, an out-of-the-box Munin configuration on Debian contains nearly all of these. I recommend setting up Munin and having a look at what it monitors by default, even if you don&#x27;t intend to use it as your monitoring solution.
评论 #7213913 未加载
tantalor超过 11 年前
Some people, when confronted with a problem, think “I know, I&#x27;ll send an email whenever it happens.” Now they have two problems.
评论 #7214627 未加载
评论 #7216332 未加载
评论 #7214199 未加载
dredmorbius超过 11 年前
The corollary of this post is &quot;things we&#x27;ve been monitoring and&#x2F;or alerting on which we shouldn&#x27;t have been&quot;.<p>Starting at a new shop, one of the first things I&#x27;ll do is:<p>1. Set up a high-level &quot;is the app &#x2F; service &#x2F; system responding sanely&quot; check which lets me know, from the top of the stack, whether or not everything else is or isn&#x27;t functioning properly.<p>2. Go through the various alerting and alarming systems and generally dialing the alerts <i>way</i> back. If it&#x27;s broken at the top, or if some vital resource is headed to the red, let me know. But if you&#x27;re going to alert based on a cascade of prior failures (and DoS my phone, email, pager, whatever), then STFU.<p>In Nagios, setting relationships between services and systems, for alerting services, setting thresholds appropriately, etc., is key.<p>For a lot of thresholds you&#x27;re going to want to find out <i>why</i> they were set to what they were and what historical reason there was for that. It&#x27;s like the old pot roast recipe where Mom cut off the ends of the roast &#x27;coz that&#x27;s how Grandma did it. Not realizing it was because Grandma&#x27;s oven was too small for a full-sized roast....<p>Sadly, that level of technical annotation is often lacking in shops, especially where there&#x27;s been significant staff turnover through the years.<p>I&#x27;m also a fan of some simple system tools such as sysstat which log data that can then be graphed for visualization.
jlgaddis超过 11 年前
Be sure to monitor your monitoring system as well (preferably from outside your network&#x2F;datacenters)! If you don&#x27;t have anything else in place, you can use Pingdom to monitor one website&#x2F;server for free [0].<p>I was off work for a few months recently (motorcycle wreck) and removed my e-mail accounts from my phone. Now, I have all my alerts go to a specific e-mail address and those are the only mails I receive on my phone. It has really helped me overcome the problem of ignoring messages.<p>[0]: <a href="https://www.pingdom.com/free/" rel="nofollow">https:&#x2F;&#x2F;www.pingdom.com&#x2F;free&#x2F;</a>
comice超过 11 年前
We monitor outgoing smtp and http connections from anything that requires those services.<p>And the best general advice I have is split your alerts into &quot;stuff that I need to know is broken&quot; and &quot;stuff that just helps me diagnose other problems&quot;. You don&#x27;t want to be disturbing your on-call people for stuff that doesn&#x27;t directly affect your service (or isn&#x27;t even something you can fix).
mnw21cam超过 11 年前
Also, are your backups working.
jsmeaton超过 11 年前
We had a perfect storm of problems only 2 weeks ago.<p>1. A vendor tomcat application had a memory leak, consumed all the RAM on a box, and crashed with an OOM<p>2. The warm standby application was slightly misconfigured, and was unable to take over when the primary app crashed<p>3. Our nagios was configured to email us, but something had gone wrong with ssmtp 2 days prior, and was unable to contact google apps<p>3a. No one was paying any attention to our server metric graphs &#x2F; We didn&#x27;t have good enough &quot;pay attention to these specific graphs because they are currently outside the norm&quot;<p>A very embarrassing day for us that one.<p>We&#x27;re now working on better graphing, and have set up a basic ssmtp check to SMS us if there is an issue. Monitoring is hard.
评论 #7215195 未加载
评论 #7214645 未加载
sp332超过 11 年前
You&#x27;re using icanhazip.com in production? I see from a quick Google search that Puppy Linux seems to use it in some scripts, but how reliable is it?
评论 #7213466 未加载
评论 #7213668 未加载
评论 #7213652 未加载
评论 #7213528 未加载
baruch超过 11 年前
About reboot monitoring, I suggest to use kdump to dump the oops information and save it for later debugging and understanding of the issue. It may even be an uncorrectable memory or pcie error you are seeing and the info is logged in the oops but is hard to figure otherwise. Also, if you consistently hit a single kernel bug you may want to fix it or workaround it.
lincolnpark超过 11 年前
Also, are your API endpoints working properly.
评论 #7216334 未加载
jlgaddis超过 11 年前
I have gear in three different facilities and I&#x27;m typically visiting any of them unless I&#x27;m installing hardware or replacing it. Shortly after starting at $job, I realized there was no monitoring of the RAID arrays in the servers we have. That could have ended badly.
herokusaki超过 11 年前
How oversold your VPS provider&#x27;s server is commonly blamed for slowdown but rarely measured.
stephengillie超过 11 年前
Between PRTG and Windows, almost all of that is handled for us. And PRTG can call OMSA by SNMP.