So if you were backing up your data to Tarsnap, then you'd be up and running as quickly as you could launch a new instance and redownload everything. And $500 credit is enough to power a micro droplet for 100 months, or a small droplet for 50 months. DO handled this well.<p><a href="http://www.tarsnap.com" rel="nofollow">http://www.tarsnap.com</a><p>EDIT: s/years/months/g. Thanks.
So this is a technical problem I am having right now that's preventing me from backing up a Postgres database completely (hope someone here can help).<p>I have a master Postgres database that is receiving a TON of transactions per second (I'm talking about thousand concurrent transactions). We tried running pg_dump on this database, but the DB is just too huge, and it took more than 4 days to completely dump out everything. Not only that but it impacted performance to the point where backing it up was just not feasible.<p>No problem.. just create a slave-DB and run pg_dump on that, right? We did just that, but the problem is that you can't run long running queries on a hot standby (queries that take more than a minute).<p>What would you do in my scenario? With the hot standby, I technically am backing up my data, but I would have 100% piece of mind if I could daily backups in case someone accidentally ran a "DROP DATABASE X", which would also delete the hot standby/slave db as well.
The abrasive headline is kind of unfortunate, as the actual moral of the story given at the end is exactly the right takeaway: Never assume your hardware is infallible, so always have backups that you know you can use when your server experiences a wildly improbable catastrophe.<p>Also, very impressed by Digital Ocean's response here. Given their reputation as a budget host, they really do put a lot of effort into service.
That's <i>way</i> more compensation than I would have expected. AWS usually won't even notify you until after the node has gone down.<p>Hardware failures happen; an application needs to be tolerant of it.
It's great you had backups, but why a write-up. Is it an attempt to smear DO's otherwise good name? It's an un-managed VPS so it's your responsibility to keep backups of your box, not theirs. And hardware fails all the time, so you can expect this to happen anywhere.
> And if you just launched and have a single instance running, let your alpha users know that there will probably be some downtime.<p>That's true. But there's no reason for extended downtime even if that instance goes down. Make sure your whole setup is described in chef/puppet/salt/ansible/cf/whatever and even a rebuild from scratch takes only minutes then. There's really little reason to skip that these days.
DO is affordable enough that the minimum you should run are 2 droplets. Having said that, I'm actually fairly impressed with the 500 credit and now you have no excuses to run 2 vms. Consider it a lesson learnt.
DigitalOcean's pricing page indicates that "All cloud hosting plans include automated backups". (<a href="https://www.digitalocean.com/pricing" rel="nofollow">https://www.digitalocean.com/pricing</a>) From the email you received, it sounds like this is clearly not the case. I wonder what other claims DigitalOcean is making that are not true.
This might sound a bit glib, but raid 5 shouldn't really be used in modern storage.<p>If you ignore the performance issues (which can vary by device) its just not safe. Depending on the size of drive can take anywhere up 30hours+ to rebuild.<p>bear in mind that you tend to use disks that are all the same batch, it leaves you in the danger zone for far too long.<p>Your options are:
somesort of clever RAID (ZFS type thing)
Another type of clever RAID (Like the LSI chunk thingy in the DCS37000)
RAID 10
Was this really a dual drive failure, or was this the rather common single drive failure plus undetected errors on a backup drive, that show up when trying to rebuild?<p>Because that happens a lot, and it's very important to do a <i>full</i> read of every drive in the array at least weekly! You have two options for doing that:<p>If you are using linux md raid then run the "check" command, which automatically does the test using background I/O (but does still impact things). On debian, and perhaps other distros too the mdadm command will do it every month by default. Make sure to set a minimum speed or it might never finish if you have a busy system.<p>You can also use the built in SMART on the disk to do a long self test. This also uses background I/O and I think it has a bit less impact on existing operations. (But you have to have some idle time on the disk or it will never finish.) If you install smartmontools you can set smartd to do this test for you every week, and keep an eye on the results.<p>I personally do both, plus a short self test of the disk every night.
I truly believe that we did the best we could in this instance. Drive failures are always always unfortunate, even with backups, downtime exists.<p>That being said, we're always genuinely looking to improve, and I'd welcome your feedback on how you feel we did and how you feel we could do better. Please do reach out to me personally john@do! Thanks. :)
Good thing you had backups.<p>With that being said, these days it's a good idea to use a deployment tool or configuration management system like puppet/salt/ansible/chef/etc, especially in a virtualized environment. This will help with scalability as well as situations such as these.
This is the reason why I moved all data away from my server instances. My images are hosted by cloudinary(with s3 bucket backup) and my databases are Amazon RDS instances.
I don't care if a server goes down, I can launch a new one in a matter of minutes (with ansible) without any data loss.
The author is sweet, his conclusion was "always backup your data" if it was me I would probably say "I'm moving away, will never trust them again on my data" ..
The $500 credit from DO is quite reassuring. Usually if the HD fails and your data is lost, your out of luck. I hear the "horror" stories of some hosts reusing consumer Hard Drivers between servers so learned, Your data is your responsibility. I'm glad the OP had backups but these failures happen, thankfully DO had the business sense to compensate them.<p>Seems good advertising for DO, as any knowledgable system admin knows Drives fail. DO could have not done anything.
> And if you just launched and have a single instance running, let your alpha users know that there will probably be some downtime.<p>How about instead "alpha users should know that there will probably be some downtime". Multiple instances don't really fix that.
Nice move from DO to give everyone $500 credit. As I remember, they don't guarantee data safety (you still need backups even if they did). Double disk failure is a rare thing, but it happens.