From 2003-2005: Co-created our database release process, integrated it with our code release process, developed our production problem management process, and built a bunch of our production monitoring tools.<p>Many of those things have since been superseded in the intervening 15 years, but it still pleases me to walk by the NOC and see tools of mine that I wrote 10-15 years ago <i>still running</i> (now maintained by others, but still running).<p>One of the most useful and longest-lived tools is one of the simplest (I literally built the essence of it in 4 hours, 6-10 PM one evening). It graphs a timeline, 1 second per pixel in X, logarithmic dollar value in Y, plot every order. That was the first version.<p>It's since evolved to have a bunch of per-minute summary data on the screen (AOV, CR%, errors/info/warning/404s, total bookings, paid vs unpaid orders, database connections in use, idle connections available, long-running transactions, long-running pages, etc per minute), records to a database, so you can "playback" outages or go exploring, etc. It's not the best tool for deep digging, but when you want a fast-reacting, "quick check" that the entire site is working post-release or post-outage, it's unambiguous that people are getting all the way through checkout (or not). You might be surprised what you can learn from such a simplistic tool.