The short version: the file system got corrupted. The backup was just a file-sync over a firewire network to another machine. Meaning the bad data was backed up and presumably overwriting the older, good data. They had a RAID but the problem was a software filesystem so the errors just got stored.<p>He seems to understand how terrible of a design decision he made in regards to the back-up system, and he appears physically affected when having to admit, publicly, the details of the infrastructure (or lack thereof) that caused this.
If you have any kind of staging/testing server I'd highly recommend using your production backups to populate that on a regular basis. That way you test your new code releases with real data, and you know that your backups work.
He had RAID and was doing filesystem level backup, ie. copying over the entire Mysql DB file. When filesystem-level corruption occured, the backup script overwrote a good (perhaps 1 day old) backup file with a corrupted file, so he's backup was worthless.<p>The first thing that comes to mind is that he could have used application-level backup, ie. Mysql. The script would have noticed that the DB is corrupted because reads (SELECT) would have failed, and the backup script would have stopped and sent him an email to restore the good backup file.<p>If he used a cloud service like Amazon SimpleDB, he wouldn't have to worry about filesystem-level corruption, because that's abstracted away by Amazon. (And it's replicated.)<p>This is still not enough though. What if the site gets hacked and the hacker issues DELETE statements. Then all your data is deleted, and even if you have application-level backup, it will succeed (it will read the empty DB), thus overwriting your old backup.<p>I guess the conclusion is to keep around several copies of the data, and have sanity-checks in place to avoid overwriting good backups. In his case it was hard (given it's a homegrown application) to keep around many copies, because his DB was 500G in size.
A simple tip for those that run any kind of database:
Be sure to replicate them in master-slave (or master-master). And base your backups on taking a slave down for backups.<p>Hot backups only work for very small databases - even those that are based on LVM snapshots, tarsnap, innodb hotbackups etc. With big databases, you will be most likely IO bound and a backup will take your site down.<p>If you have lots of load and lots of data then re-creating a slave will require lots of downtime. For Plurk.com we have had a 4 hour downtime due to re-creating a slave, so be sure to run a master-slave setup and have fresh slaves replicated at all times (we have learned this the hard way :)).
My recommendation for basic backup needs: rsnapshot. I backup our public server to our internal network as well as my desktop machine to an encrypted portable drive using it: <a href="http://www.rsnapshot.org/" rel="nofollow">http://www.rsnapshot.org/</a><p>It's probably similar to rdiff-backup, which I haven't used. If you're fine with daily or hourly backups and don't have too much data (<100 GB), rsnapshot together with regular SQL dumps works fine.