TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

First 5 Minutes Troubleshooting A Server

244 pointsby balouabout 12 years ago

21 comments

antirezabout 12 years ago
That's what I often do, however it is clear that most of this tasks can be done automatically, so there should be somebody doing a 'linux-doctor' open source project that will try to identify issues automatically. Assuming it does not exist, but I never saw it before.
评论 #5365929 未加载
评论 #5367484 未加载
评论 #5365958 未加载
bashtoniabout 12 years ago
I wrote a simple bash script which is a good starting point for checking server issues.<p><a href="https://github.com/BashtonLtd/whatswrong" rel="nofollow">https://github.com/BashtonLtd/whatswrong</a><p>The idea isn't too tell you the problem exactly, but more to stop you missing things that are obviously wrong.
rachelbythebayabout 12 years ago
I wish I had recorded some of my better moments when I worked web hosting tech support. Being able to jump onto a box, poke around at two or three things, notice something wrong from that, and come up with a solution was magic. It's hard to believe unless you actually see it happen, though.<p>I have the recording capability now (and, more importantly, playback too), but the constant influx of broken boxes is gone. Funny how that works.
评论 #5366929 未加载
fduranabout 12 years ago
This is pretty similar (even we start with the same two commands) to what I typically do in my checklist :-)<p><a href="http://www.fduran.com/blog/quick-linux-server-review-for-mortals/" rel="nofollow">http://www.fduran.com/blog/quick-linux-server-review-for-mor...</a>
peterwwillisabout 12 years ago
One of the most useful pieces of information for me when troubleshooting is network activity. Both network monitors and traffic flows can often tell you exactly what the problem is so you don't have to spend five minutes collecting data samples.
评论 #5363431 未加载
falcolasabout 12 years ago
Personally, I think that top and vmstat are a bit low on the list; they're typically the first things that I run. While they're too general to provide good troubleshooting of the actual problem, they do a great job of pointing out to me where the problem probably is.<p>User experience reports are nice, but rarely indicate something other than "load is high", or "server is unresponsive". vmstat and top not only can tell you that, they can start telling you why and where to look for your problem.
rmcabout 12 years ago
I've written a command (that we're still playing around with here) to take a 'snapshot' of things that a server is doing (many of the things mentioned in this article). This can allow you to look at it later to see what's going on.<p><a href="https://github.com/rory/SystemAutopsy" rel="nofollow">https://github.com/rory/SystemAutopsy</a>
w0ts0nabout 12 years ago
One key thing missing imo. "history" can be a huge help. Even more so if there are multiple admins on call.
评论 #5361604 未加载
评论 #5361761 未加载
评论 #5361547 未加载
评论 #5361697 未加载
kondor6cabout 12 years ago
doing "ps auxf" will put "ps aux" in forest mode and give you the same effect as "pstree -a" while giving you all the information that "ps aux" will give.
viddyabout 12 years ago
Something that's been missed: atop Its a similar set to top/htop and friends, but has an additional system process snapshot daemon, allowing us to answer the perennial question of how a server got into a particular state. Something particularly useful is the wide view (see here: <a href="http://www.atoptool.nl/screenshots.php" rel="nofollow">http://www.atoptool.nl/screenshots.php</a>) and average disk response times in ms.
sanotehuabout 12 years ago
What I found interesting while reading this article was the parallel to what a doctor does when diagnosing problems.<p>In medicine, it's commonly known that the interview with the patient (the 'history') is the first thing a doctor should be doing. Not just because it establishes a relationship with the patient, but because the diagnosis of most illnesses is guided primarily by the history [1] - even with modern MRI machines and DNA amplification techniques! At the very least the chat with the client provides context for the problem that you are investigating - you are now putting flesh on a skeleton of meaning rather than trying to create it on your own.<p>This article stresses the importance of first getting a verbal 'history' from the client - what the problem is, characteristics of the problem, time-course of the problem and co-incidence with other events (like software upgrades). There is also a parallel to medicine in that in this field a skilled practitioner may be able to diagnose the problem based solely on the history alone [2].<p>The second thing I noticed was the fault-finding mindset. As a medical student halfway through his second year of hospital placements this is something I took some time to learn. The initial approach to finding the reason for a problem is usually to (1)think of a possible reason for the problem, (2)try to fix that reason, and (3)if that doesn't work, goto 1. While this is a good because it shows you are actually thinking about the cause of the problem rather than its effects, it's not the most efficient way of going about things. One way doctors can narrow down problems is by restricting them to systems such as the cardiovascular system or the neurological system. A searing pain in your chest is more likely to be due to a problem with your heart or lungs than due to a problem with your kidneys or gonads.<p>This article takes exactly the same view of servers, classifying the individual hardware and software components that make up the vast majority of (linux) servers in the wild.<p>I don't fiddle around with servers much any more, but I'm bookmarking this page because it is such a useful illustration of a fault-finding mentality.<p>[1] <a href="http://archinte.jamanetwork.com/article.aspx?articleid=1105870" rel="nofollow">http://archinte.jamanetwork.com/article.aspx?articleid=11058...</a> [2] <a href="http://blogs.msdn.com/b/oldnewthing/archive/2012/08/29/10344405.aspx" rel="nofollow">http://blogs.msdn.com/b/oldnewthing/archive/2012/08/29/10344...</a>
tseelingabout 12 years ago
I always cringe when I see shell code like this &#62; cat /etc/passwd | cut -f1 -d:<p>Usually this comes as &#62; ps -ef | grep something | grep -v grep | grep -v $myownpid<p>why not use <i>one</i> simple and concise awk statement which does it all in one go?<p>awk -F: '{print$1}' /etc/passwd ps -ef | awk '/[h]ttpd/{print$2}'<p>But apart from that: very nice summary of things to consider and the sequence for analysis.
评论 #5373764 未加载
nigglerabout 12 years ago
On OSX (even though the HN title says linux, the original article title doesn't say "Linux" and many of the same steps apply to OSX server), you can't run dmesg as a normal user -- you have to run it as root or within sudo
growtabout 12 years ago
It's a nice checklist. I personally would do dmesg and log-checks right at the beginning. Also checking sw-raid and drives is missing (cat /proc/mdstat, hdparm, fdisk -l, smartctl ...)
belornabout 12 years ago
A quite good check list/advice on server trubleshooting. The ones I am mostly missing in the list is network tools such as telnet/dig/ip/fping/mtr.
dredmorbiusabout 12 years ago
vmstat and iostat are also useful for tracking memory issues.<p>sysstat ('sar') reporting can also provide some of that much-needed history. Sar output is pretty readily visualized with utilities such as gnuplot.
coinabout 12 years ago
Yet another site that disables pinch zoom for iOS devices. Pointless..
评论 #5366068 未加载
drew510about 12 years ago
the crontab business is unnecessary. ls /var/spool/cron?<p>other than that, I learned some new tools - ss is pretty awesome!
ptmanabout 12 years ago
df -i
tlarkworthyabout 12 years ago
network cable light
martincedabout 12 years ago
All these commands are very fine and handy but...<p>Typically they're better when you can put them in context: you <i>should</i> run all these regularly on fully working servers which you know are operating normally so you have "something" you can compare your results to when the shit hits the fan.