"Amazon EBS sucks. I just lost all my data"

111 点作者 cemerick大约 15 年前

16 条评论

brk大约 15 年前

Although I sometimes get downvoted for this, I'll say it again:You can't outsource your liability.If your product is a webapp, then the underlying messy bits of backups, hardware, availability and redundancy also require some amount of conscious thought on your part. Not every site/app needs it's own mini-datacenter, and you might not even need your own dedicated server (though you probably do when you reach a certain minimal amount of scale). But you DO need to have someone who is thinking about backups and availability, and a valid solution is not to assume that the smart folks at Amazon or Rackspace or any other hosting provider are going to be completely and consistently working with your best interests and uptime in mind.EVRYTHING fails at some point. Every server, every generator, every upstream connection, every hosting provider big or small. And in this case I mean fail as in goes dark for some period of time not covered by backups or hot-spares.So, plan accordingly.

评论 #1334561 未加载

评论 #1334379 未加载

评论 #1334400 未加载

评论 #1334559 未加载

评论 #1336330 未加载

dogas大约 15 年前

Why is this on HN? AWS provides a great way to back up EBS volumes called snapshots. Snapshots only store the deltas from the previous snapshot, and all the work to create one is done by AWS, not the server it is attached to.This guy didn't read the docs and did not use AWS snapshots. It was the equivalent of not having a backup strategy for your local hard drive.

评论 #1334343 未加载

评论 #1334487 未加载

评论 #1335332 未加载

jasonkester大约 15 年前

This post happens about once a week on the Amazon forums. I've watched it play out dozens of times on the S3 and Cloudfront forums too, and every single time it turns out to be operator error.In this case, the guy didn't realize he needed to take snapshots of his volumes. It's not surprising, really, since the documentation isn't so great for AWS, and it's probably even more painful knowing that it would have been a single button click to back up his volume using Amazon's tools.But in the end, there's nothing to see here. Just like the guy who wakes up in the morning to find all his S3 files mysteriously gone (after he 'renamed' his bucked the previous night by dropping and recreating it), it always turns out to be the user shooting himself in the foot.And in the cases when Amazon actually does something wrong, they're always on top of it immediately and back with a public explanation within hours. (from my experience)

评论 #1335435 未加载

jwr大约 15 年前

> "expect an annual failure rate (AFR) of between 0.1% –0.5%, where failure refers to a complete loss of the volume"Well, I think the OP has just experienced a sample from a probability distribution characterized above.

评论 #1334350 未加载

评论 #1334525 未加载

评论 #1336336 未加载

评论 #1337009 未加载

garnaat大约 15 年前

There was a pretty lively exchange on twitter last night regarding this. I strongly disagree with the AWS forum poster. EBS does not suck. In fact, EBS and other services from AWS and Rackspace provide the building blocks to allow you to construct incredibly scalable, available systems.However, you have to accept that when you use IaaS you are taking on some of the operational responsibility and you have to know what you are doing or find someone who does. If this user had been snapshotting regularly to S3, the worst thing they would have experienced is a couple of hours of downtime. All of their data would have been safe and easily recovered.They didn't do that and the worst case scenario that AWS clearly describes in it's docs (failure of MULTIPLE devices) happened. And it will happen again, someday. Accept that and accept that failure is a feature when systems are designed properly.

tkaemming大约 15 年前

The Amazon EBS page states (which the author quotes):> As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.Nowhere within that does it say 0.00% failure rate, and later in the page they even describe how to mitigate the risk of losing data due to disk failure using snapshots, mirrored across availability zones.

jrockway大约 15 年前

Hard drives suck just as bad. I have a RAID-1 built from three disks out of separate batches. Somehow, I wasn't paying attention to bad sectors the RAID software couldn't fix, and all the disks failed.Cheap 1TB disks and cheap cloud storage like EBS means that it's now cheaper than ever to lose a shit-ton of data. (I didn't actually lose anything important, the corrupted areas were not important files. But still; three drive failures in a week!)My fatal mistake, BTW, was ordering from Newegg. Apparently they do not ship OEM drives correctly, and they are almost guaranteed to fail. I was a little suspicious when I saw a raw drive in a plastic shell with some packing peanuts around it. When I had the drives replaced, they did not come from the factory that way!

评论 #1334471 未加载

评论 #1334538 未加载

vl大约 15 年前

Most people that don't have extended experience with large-scale data stores do not understand basic principle: redundancy decreases probability of data loss, but it never eliminates it completely. All massive data stores slowly bleed data, it's just they bleed it so slow that it's acceptable for most scenarios. In case of this specific example, once number of users is large enough, there always be somebody who lost their volume.To illustrate this: think about a-la-GFS randomly triplicating data store on 1000 nodes. Once enough data is put in (lets say 100M blobs), there always be blob unique to any given triplet. In other words simultaneous loss of any 3 nodes out of 1000 will always result in data loss. (Simultaneous is in the sense "faster than time to detect failure and recover"). Of course failures are not limited to node loss, but there is corruption in transit, hard drive loss, bad sectors, rack-level failures. As the volume of the data and number of nodes grows it all adds up, so even if for each particular blob mean time to data loss is astronomically high, probability to loose some blob on any given day is very real.

imp大约 15 年前

...because I didn't have my own backup."

mattew大约 15 年前

Is there any way to set EBS to auto-snapshot on a specified time period through the existing control panel interface? Are snapshots possible through the API?

评论 #1336277 未加载

评论 #1334480 未加载

hipsterelitist大约 15 年前

At least you didn't have to pay thousands of dollars to delete your data!

mml大约 15 年前

Odd, my car lost my coffee when I put it on the roof on the way to work. Good thing there was backup coffee at the office.

评论 #1334555 未加载

mark_l_watson大约 15 年前

Sounds like he did not make S3 snapshots of his EBS volumes. Ouch. I feel very confident about the robustness of data that I store on AWS because I can make an S3 snapshot, and recover from that snapshot on a fresh EC2 to test the backup. BTW, I changed the way I use AWS: now I always make bootable EBS images, increasing the size > the 10 GB limit so I snapshot my OS setup and data and apps all at the same time.

Bjoern大约 15 年前

There is a easy rule of thumb.Make a backup of your important stuff often and regularly no matter how many redundancies are in place (see Murphy's law).Right now? Yes, like really right now if you didn't.EDIT: Spelling

braindead_in大约 15 年前

Can this also happen to an S3 bucket? How do I backup an S3 bucket? Any ideas?

评论 #1334652 未加载

评论 #1334656 未加载

rodh257大约 15 年前

another reason to <a href="http://lookafteryourdata.com" rel="nofollow">http://lookafteryourdata.com</a>