As an operations dude (or, at least, a recovering one) I've spent much of my professional life fighting post-deploy issues at the coal face, which almost always involves (when it's an app issue) digging into the code; even when we do that though, an important bit is observing what the boxes are doing in real time (we're not all dumb "INSTALL RPM AND RESTART" guys!) so we can explain impact and mitigate.<p>Usually, my goal then is to either:<p>1) Find a configuration / infra issue we can solve (best outcome for everyone)
2) Give the most info to dev to enable a code fix, and roll back/mitigate in the interim.<p>In the last few years, people have paid me lots of money to build these really cool ELK or Splunk log chewing systems for them, which I have to admit, are utterly useless to me. There are really great monitoring tools which give me historical graphs of stuff I usually don't care about too.. but... I, and most of the folks I run with don't really reach for these tools when we hit a prod issue as the first resort.<p>Lets say, hypothetically, a customer of mine has an issue where some users are getting timeouts out on some API or another. We got alerted through some monitoring or whatever, and so we start taking a look.<p>First step, for me is <i>always</i> login a webserver at random (or all of them) and look at the obvious. Logs. Errors. dmesg. IO. Mem. Processes. The pretty graph ELK tools can tell me this info, but what I want to look at next is easier to jump to when I'm already there, than trying to locate IOWAIT in something like splunk.<p>All looks good on the web servers. Ok. Lets check out the dbs in one term and the apps in another. You follow the request though the infra. Splunk or ELK can tell me one of my apps is eating 25,000 FDs but then what? I need to login anyway. Next on the list are tools like strace/truss, iostat, netstat and so on which will immediately tell you if it's an infra/load/config issue, and we go from there. Down into the rabbit hole.<p>The point I'm trying to make is; for me at least, the tools we're deploying and being paid well to deploy now like dataloop, newrelic, splunk and so on are actually useless for solving real-time prod issues (for me personally, and my crew, at least) because they only expose a very small amount of info, and almost regardless of the issue I'll need to be on the box looking at something unique to the problem to either explain the impact of it or to mitigate it.<p>As I said though, I'm a recovering ops person and I'm doing dev these days. I still tend to use print statements when I hit a bug; although since I'm now mostly doing Erlang, bugs are rare and usually trivial to track down.