If you guys are writing post-mortem blog posts due to running out of disk space, your solution should really be to hire a sysadmin or ops-focused engineer. Disk-related issues are among the easiest to diagnose, so take this experience as a wake-up call that your current team is in over your heads. If you can't afford a sysadmin or don't want to bring that kind of talent in-house, you can try using a hosted solution. But make sure to really test out different hosted services before committing to one, since they can vary tremendously in terms of quality and reliability.
I have used Monit for years for basic server monitoring. It's a very tiny daemon with a single, simple config file. Basically I can tell it "when disk space exceeds X, or RAM exceeds Y, or CPU exceeds Z, or process identified by pidfile foo.pid isn't running, or I can't ping something, email me". No monitoring servers, no network polling, no SNMP, no monthly fees. Sounds like five lines of Monit config would have saved these guys. See the config file docs at <a href="http://mmonit.com/monit/documentation/monit.html" rel="nofollow">http://mmonit.com/monit/documentation/monit.html</a> .
This issue is oddly similar to issues seen at a prior gig, where MSSQL and MySQL transaction logs (replication bin logs for MySQL), consumed excess disk space when large operations did fully replicate (for various reasons), and the log volume filled.<p>Monitoring helps, but unless your Ops staff knows what to do with a misbehaving database (RDBMS or other), it falls on the DBA or equivalent.
I'm no server admin, but it seems to be a recurring theme where big issues are narrowed down to disk space running out. Is there not something that can automatically check this and send out alerts?
Reading that site is like the way I imagine having cataracts must be.<p>Please, more contrast between text and background. It's like reading through a haze.