Yay for automated monitoring software. Nagios (<a href="http://www.nagios.org/" rel="nofollow">http://www.nagios.org/</a>) does this for networks (and is extendable for some other things). At my old job we used Hobbit (<a href="http://hobbitmon.sourceforge.net/" rel="nofollow">http://hobbitmon.sourceforge.net/</a>) to watch our Java server instances (memory usage, etc.). There’s no reason why these monitoring programs couldn’t be used to monitor internal program statistics, as long as those stats were made available.<p>Generally you monitor from your internal network, and then provide some hook for the monitor to get information that’s only accessible from there. (SSH or a limited-access URL, etc.)<p>Monitoring programs are super-powerful and generally complex. Check them out — it’s a good skill to have when working with production software.<p>(I also posted this on the article.)
I've worked with threshold logic like that for collecting and analyzing traffic on telephone switches where an alarm or notification would be generated if the threshold was broken.<p>Personally I would never want to debug something like that using a statistical probability that something <i>might</i> have gone wrong. Better to fail gracefully with something like multiple chains so that when a request chain goes down it gets logged, cleaned up, and recreated.<p>Worst case scenario they get a request timeout warning.