"Amazon's EBSs are a barrel of laughs in terms of performance and reliability"

270 点作者 quilby大约 14 年前

30 条评论

snorkel大约 14 年前

Having been at a startup that used hundreds of EC2 instances and EBS volumes I can assure you all that Amazon EBS performance is downright terrible and Amazon didn't inspire any confidence that they could solve it.Even worse than the EBS performance is Amazon does not offer any shared storage solutions between EC2 instances. You have to cobble together your own shared storage using NFS and EBS volumes making it sucky to the Nth power.EC2 is fine for Hadoop-style distributed work loads, and distributed data stores that can tolerate eventual consistency, that's all good. But for production database applications requiring constant and reliable performance, forget it.

评论 #2340960 未加载

SemanticFog大约 14 年前

We had consistent serious problems related to EBS for a several-month streak about a year ago, and I heard almost identical stories from other EC2 users around the same time. Instances with EBS attached would suddenly become completely unreachable via the network. Sometimes we had to terminate the instances, but usually we could revive them by detaching all (or most) of the EBS volumes, then reattaching and rebooting. Amazon seems to have fixed this problem, but I wouldn't be surprised if we suffered in the future the way reddit has.Overall, EC2 is a very impressive offering, for which I commend Amazon. At times, I've been so frustrated that I'm ready to switch, but they fix things just quickly enough that I never quite get around to it. In the end, I'm willing to accept that what they're doing is hard, there will be mistakes, and it's worth suffering to get the flexibility and cost-effectiveness that EC2 offers.

评论 #2340399 未加载

jameskilton大约 14 年前

This comment further down, supposedly from an Amazon employee, paints a grim picture for EBS: <a href="http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l7vy1" rel="nofollow">http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...</a>

评论 #2340559 未加载

评论 #2340657 未加载

评论 #2340953 未加载

评论 #2343080 未加载

rlpb大约 14 年前

RAIDing together multiple EBS volumes feels like a massive hack to me. I can't help but wonder if this compounds the problem at Amazon's end. If EBS performance is a problem, Amazon need to fix it. For example, if some way of tying together multiple EBS volumes is a reasonable way of working around the problem, then why aren't Amazon providing "high performance" EBS volumes which do that under the hood?If I were faced with EBS performance issues, I would see this as a big red flag, consider EBS unsuitable for the application and avoid it, rather than carrying on with such a workaround.

评论 #2340656 未加载

评论 #2340741 未加载

评论 #2340934 未加载

parasubvert大约 14 年前

Generally speaking this is the sort of thing that people warn about when they say "if you want to run on a cloud, you need to design your application for a cloud". Meaning, you can't presume your infrastructure is dedicated and carries similar MTBFs of (say) an enterprise hard drive, which upwards of 1 million hours.Amazon provides plenty of opportunities to mitigate for this, such as providing multiple availability zones. Reddit, if you read the original blog post, wasn't designed for that - it was designed for a single data centre.OTOH, the variability of EBS performance is true, and frustrating. If you do a RAID0 stripe across 4 drives, you can expect around sustained 100 MB/sec in performance modulo hiccups that can bring it down by a factor of 5. On a compute cluster instance (cc1.4xlarge) it's more like up to 300 MB/sec if you go up to 8 drives, since they provision more network bandwidth and seem to be able to cordon it off better with a placement group.

评论 #2341150 未加载

jedsmith大约 14 年前

Never fails: a cloud provider has issues with a specific cloud product, so clearly the cloud is an illusion that will crash down on you[1]. Any discussion about any cloud provider's product is obviously a chance to soapbox about the industry as a whole.[1]: <a href="http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l7531" rel="nofollow">http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...</a>

评论 #2340597 未加载

tzs大约 14 年前

We've been looking at moving some or all of our stuff to either Amazon EC2/EBS/S3 or Rackspace cloud hosting, and it has been interesting.Amazon seems more flexible, since you buy block storage (EBS) independent of instances. If you have an application that needs a massive amount of data, but only a little RAM and CPU, you can do it.Rackspace, on the other hand, ties storage to instances. If you only need the RAM and CPU of the smallest instance (256 MB RAM) but need more than the 10 GB of disk space that provides, you need to go for a bigger instance, and so you'll probably end up with a bigger base price than at Amazon.On the other hand, the storage at Rackspace is actual RAID storage directly attached to the machine you instance is on, so it is going to totally kick Amazon's butt for performance. Also, at Amazon you pay for I/O (something like $0.10 per million operations).Looking at our existing main database and its usage, at Amazon we'd be paying more just for the I/O than we now pay for colo and bandwidth for the servers we own (not just the database servers...our whole setup!).The big lesson we've taken away from our investigation so far as that Amazon is different from Rackspace, and both are different from running your own servers. Each of these three has a different set of capabilities and constraints, and so a solution designed for one will probably not work well if you just try to map it isomorphically to one of the others. You don't migrate to the cloud--you re-architect and rewrite to the cloud.

评论 #2341464 未加载

评论 #2341358 未加载

mithaler大约 14 年前

We were bitten by EBS' slowness at my company recently, when moving an existing project to AWS. You effectively can't get decent performance off of a single EBS volume with PostgreSQL; you need to set up 10 or so of them and make a software RAID to remove the bottleneck. It's a fairly large time commitment to build and maintain, but it's pretty fast and reliable once it's up and running (cases like the recent downtime notwithstanding).Can anyone tell me if MySQL fares any better than Postgres on a single EBS volume? I wouldn't assume it does but I shouldn't be making assumptions.

评论 #2340706 未加载

评论 #2340839 未加载

评论 #2340917 未加载

hemancuso大约 14 年前

I've never understood how people can use EBS in production. The durability numbers they quote are bad and they wave their hands around about increased durability with snapshots, but never quantify what that means.Hard drives are unreliable and they certainly don't fail independently of one another - but the independence of their failure is much more independent than EBS.With physical dives and n-parity RAID you drastically reduce the rate of data loss. This is because although failures are often correlated, it's quite unlikely to have permenant failure of 3 drives out of a pool of 7 within 24 hours. It happens, but it is very rare.With EBS, your 7 volumes might very well be on the same underlying RAID array. So you have no greater durability by building software RAID on top of that. If anything, it potentially decreases durability.You could utilize snapshots to S3, but is that really a good solution? It seems that deploying onto EBS at any meaningful scale is a recipe for garunteed data-loss. Raid on physical disks isn't a great solution either, and there is no substitute for backups - but at least you can build a 9 disk RaidZ3 array that will experience pool failure so rarely that you can more safely worry about things like memory and data bus corruption.

评论 #2340959 未加载

prakash大约 14 年前

We (Cedexis) presented our findings on - How do EC2's East, West, EU & APAC zones compare: (pdf) <a href="http://www.cloudconnectevent.com/2011/presentations/free/76-marty-kagan.pdf" rel="nofollow">http://www.cloudconnectevent.com/2011/presentations/free/76-...</a>If you would like to know more please send me an email: prakash [at] cedexis.com

评论 #2341087 未加载

gruseom大约 14 年前

Anybody care to comment on using EC2 with local (what Amazon calls ephemeral) storage and backup to S3? Seems to me the advantages are: it's cheaper and you avoid the performance and reliability problems with EBS. The disadvantages?

评论 #2341222 未加载

评论 #2341262 未加载

评论 #2341362 未加载

Kilimanjaro大约 14 年前

Lesson for startups: start in the cloud, grow your business, build your own cloud.Never trust critical parts of your business to others.

评论 #2342532 未加载

评论 #2341307 未加载

floodfx大约 14 年前

I'll probably be downvoted for this but seems to me the root cause of this problem is Reddit's architectural decision to remain in a single availability zone. If it wasn't EBS it could have been some other issue related to the single AZ that could have brought the site down. Blaming EBS, particularly if you knew it to be a potential weakness in your architecture, seems like a deflection of responsibility.

评论 #2341328 未加载

评论 #2342321 未加载

bmurphy大约 14 年前

Having been running a 200gb millions of transactions per day Postgres cluster on Amazon's EC2 cloud for two years now, I can attest to the fact that EBS performance and reliability SUCKS. It is our SINGLE biggest problem with EC2.200gb really isn't all that big of a database. It shouldn't have to be this hard.

steve918大约 14 年前

This very moment our team is restoring Postgres volumes because the EBS volumes our primary and secondary were on both failed simultaneously.

评论 #2341985 未加载

absconditus大约 14 年前

How is it that Amazon.com is so reliable if there are so many problems with their "cloud" products? Do they not use the same software to run their site?

评论 #2340828 未加载

评论 #2340825 未加载

评论 #2341440 未加载

jread大约 14 年前

I was at the Cloud Connect conference last week. In a session on cloud performance Adrian Cockcroft (Netflix's Cloud Architect) spoke and said they do not use EBS for performance and reliability issues. They initially had some bad experiences with EBS and because of this decided to stick with ephemeral storage almost exclusively.The guys from Reddit also spoke about their use of EC2. Apparently they are running entirely on m1 instances which suffer from notoriously poor EBS performance relative to m2 and cc1/cg1 instances.

danielrhodes大约 14 年前

What's the failure rate of EBS versus having direct access to physical disks? My guess is that at scale, it's probably similar.Although you would hope that the storage components of AWS's cloud were highly reliable, I think the main benefit is not single instance reliability but being able to recover faster because of quickly available hardware.

评论 #2341692 未加载

ck2大约 14 年前

I firmly believe "the cloud" is a fad, unless for some reason you own and operate all the hardware yourself (ie. Google).Like other technical fads, everyone will probably come back to servers they can reach out and touch when needed, sooner or later.

评论 #2340689 未加载

评论 #2340676 未加载

评论 #2341117 未加载

评论 #2341528 未加载

评论 #2342333 未加载

评论 #2343626 未加载

评论 #2376826 未加载

obfuscate大约 14 年前

For a data set in the mere tens to hundreds of GB (in MongoDB, if anyone's curious), is there any reason I shouldn't conclude from this that I should use instance storage only (with multi-AZ replication and backups to S3, both of which I would be doing in any case)? Moderately slower recovery in the rare event of an instance failure seems better than the constant possibility of incurable killing performance degradation.(Edit: I hadn't considered the possibility of somehow killing all my instances through human error. Ouch. That probably warrants one slave on EBS per AZ.)

Zak大约 14 年前

I recently had an EBS volume lose data for no apparent reason. I'm not a heavy EC2 user at all - I was just doing some memory/cpu-heavy stuff that wouldn't fit in to RAM on my laptop and using EBS as a temporary store so I could transfer data using a cheap micro instance and only spin up the big expensive instances when everything was in place. I ended up downloading files on an m2.4xlarge because the files I had just downloaded to the EBS volume vanished.

评论 #2340875 未加载

cpg大约 14 年前

This seems too much of a coincidence.We released a dropbox-like product to sync and the back-end is on EBS. Yesterday we saw two times when a device got filled to 7GB and as it got closer it became slower and slower and slower. We did not have any instrumentation/monitoring in place and we were immediately suspect it was something on our end.We (wrongly?) assumed reliability and (decent) performance from AWS.

j_s大约 14 年前

Being totally new to AWS, why does everyone skip right past using ZFS?<a href="http://blogs.sun.com/marchamilton/entry/a_brilliant_argument_for_zfs" rel="nofollow">http://blogs.sun.com/marchamilton/entry/a_brilliant_argument...</a> "Cloud Storage Will Be Limited By Drive Reliability, Bandwidth ... The key feature of ZFS enabling data integrity is the 256-bit checksum that protects your data."

评论 #2342936 未加载

PaulHoule大约 14 年前

I love the idea behind EBS, a SAN makes life so much easier, but I too find that EBS glitches are the largest cause of unreliability in AWS.I'm not immediately planning to move out of AWS, but the trouble with EBS has certainly got me thinking about other options and has made me much less inclined to make an increased commitment to AWS.

评论 #2340609 未加载

natch大约 14 年前

Isn't EBS intended for stuff like Hadoop job temporary data used during processing?This kind of complaint reminds me of people who buy a product that does A very well, but then they trash it in reviews for not doing B. It was never advertised as doing B, but you'd never know that from the complaining.

amitraman1大约 14 年前

We used Amazon and got bad performance in the beginning too. It is bad when you pull files out of S3. By bad I mean the latency is high.We tried GoGrid and they lost or crashed our server instance.I've personally used Rackspace, so far so good, but I've only been doing development on it.

jclouds-fan大约 14 年前

Why is reddit relying on only one cloud provider? AWS can/should do better but service providers of the size of reddit should be using mult-vendor set-ups for sure.

评论 #2340470 未加载

评论 #2340464 未加载

评论 #2342340 未加载

yuhong大约 14 年前

On the comment itself, I have this: <a href="http://news.ycombinator.com/item?id=2339715" rel="nofollow">http://news.ycombinator.com/item?id=2339715</a>

lurker17大约 14 年前

EMR is a mess too. The Amazon-blessed Pig is almost a year and 2 major releases behind, and the official EMR documentation seems to describe a version of EMR that doesn't even exist."Elastic" is AWS's claim to fame, but I am not seeing it.Trying to resize an EMR cluster (which is half the point of having an EMR cluster instead of buying our own hardware) generates the cryptic error "Error: Cannot add instance groups to a master only job flow" that is not documented anywhere.(Why would Amazon even implement a "master only job flow", which serves no purpose at all?)

评论 #2342910 未加载

Andys大约 14 年前

The AWS business model is to sell shared hosting on commodity hardware. Cloud is a cool buzzword but it is still sharing hardware. Cheap, commodity hardware is the magic that lets you scale up so big and so fast for a highly accessible price.But you're still sharing the same hardware as everyone else and its still just commodity hardware.

评论 #2340594 未加载

评论 #2340847 未加载