TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: How do you troubleshoot network problems?

3 pointsby ilurkabout 10 years ago
Is there a good source of information on how to tackle network problems or is this an hopeless ad-hoc skill?<p>After reading this recent thread [0] I have to wonder how the heck would I troubleshoot this!?<p>I know the basics: - check if the NIC module is loaded - check if you have an IP - check that you can ping the gateway - check that you can ping the your target - check that you can telnet to port - check iptables (or temporarily flush all and set default to accept) - try to see something that stands out in wireshark<p>But how do you go from there?<p>For example, one friday ago, during wee hours I lost connection to the target host. I was ssh multi hopping and for some reason my immediate thought was &quot;the target host &#x27;died&#x27;&quot;. &quot;But wait a sec, let&#x27;s go one step at a time&quot;.<p>It was in fact the connection to my first hop at our local network. So I went to the physical machine. I had an IP. And every time I restarted the network I always got an IP. But I was unable to ping the gateway... 90% of the times. Then it came to me that some months ago other people on the same subnetwork had complained in the past about temporary network loss. I restarted the switch but it didn&#x27;t help. My current theory is that of a faulty network switch, but this is yet to be confirmed.<p>The problem I had&#x2F;have? looks to be way simpler than the one mentioned in [0], but networks feel a bit like some dark magic.<p>So any advices on improving your network detective skills?<p>[0] https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=9555057

2 comments

mobiplayerabout 10 years ago
Always, always, always:<p>Simultaneous captures on both client and server. I&#x27;ve said it before and I&#x27;ll say it again: The truth is on the wire. Once you got that, you can walk your way down with captures on intermediate devices until you catch where the packets are dropped.<p>Once you&#x27;re there it really depends on the product&#x2F;OS&#x2F;service&#x2F;filter that&#x27;s dropping the packet.<p>If you&#x27;re really experienced you can start building the house from the roof, i.e. trying by changing this or that as in your head the symptoms match other case you&#x27;ve seen before. If you&#x27;re not, just do baby steps and you&#x27;ll get there sooner than what you think.<p>Of course you can check basic connectivity first, but when I hear &quot;network problem&quot; I understand that has been checked already.<p>Edit: by the way, your problem might look like dark magic if you don&#x27;t actually want to really know what&#x27;s going on. Do you know how a packet is delivered to a host in your subnet? I guess not, but just so you know your machine will first try to learn the destination&#x27;s MAC address (again, if it&#x27;s inside the same subnet). Can you confirm you resolve your gateway&#x27;s MAC address when you try to ping? Can you confirm the resolved MAC address is the correct one? There might be someone spoofing it to intercept all the traffic exiting the subnet or maybe you&#x27;ve got a pair of routers&#x2F;firewalls and their HA setup is not working as expected. In any case, go a layer down and check that.
floppydiskabout 10 years ago
Start simply and work your way up. Networks can be complex beasts with a lot of moving parts--especially if you&#x27;re moving across sub nets--and if you have high traffic volume etc.<p>For your immediate problem, traceroute would be a good place to start to help figure out where you&#x27;re dying on the local network. If the switch is eating packets and connections, the traceroute should die at the switch.<p>With the switch being faulty, before you assign hardware as a cause, check the bandwidth load you&#x27;re pushing across the switch. If you&#x27;re saturating the link and trying to push more stuff through the switch than it can support, it will cause what appears to be connection loss. Also check for a feedback loop somewhere, i.e. someone plugging a cable back into the switch creating a packet storm that doesn&#x27;t die.<p>Learning networks takes time and experience, and each network is a different beast with different usage patterns, hardware, and characteristics. A couple useful rules of thumb that I&#x27;ve found helpful when dealing with network debugging are as follows:<p>1) Start simply. Ping&#x2F;traceroute&#x2F;tcpdump&#x2F;nc (netcat) are your first best friends on Linux and should be the first place you start when you&#x27;re debugging by hand. NC is a pretty sweet program because it allows you to set up arbitrary TCP connections between two machines without having to standup a full software stack. nc -l &lt;portnum&gt; to set one up. This can be incredibly helpful if you have software that&#x27;s supposed to connect over the network and isn&#x27;t work, set up a nc instance on the target port to see if there&#x27;s a connection attempt. If so, it means the bug is further up the stack.<p>2) Networks can be complicated beasts with a lot of interchanging parts that interact in complex and sometimes unpredictable ways. Start simply and work your way up in complexity when trying to ascertain cause.<p>3) Make sure your system logging and system monitoring are paying attention to your network activity and issue warnings when things happen like connectivity outages. Decent monitoring software can save a lot of debugging time because it&#x27;ll tell you want happened. Our IT guys used Icinga to monitor everything. Bit of a PITA to setup, but worth it to be able to see what&#x2F;where things were going wrong.<p>4) Bandwidth isn&#x27;t finite. Check your hardware to see if it&#x27;s saturated. We had several issues where our internal network traffic was saturating the original network design causing everyone on the network to experience absolutely torrid performance. After we rewired the network and isolated the chatty boxes on their own switches, network performance improved drastically. Don&#x27;t assume the network in general hasn&#x27;t outgrown the network design or that burst traffic isn&#x27;t overloading the setup. It&#x27;s possible and also explains the intermittent outages.