I am an active user of EBS on a highly trafficked Web properly, and came from a long and tedious background in enterprise software.<p>I really think that one paragraph in his blog post summed everything up quite nicely. It could not ring more true:<p><i>My opinion is that the only reason the big enterprise storage vendors have gotten away with network block storage for the last decade is that they can afford to over-engineer the hell out of them and have the luxury of running enterprise workloads, which is a code phrase for “consolidated idle workloads.” When the going gets tough in enterprise storage systems, you do capacity planning and make sure your hot apps are on dedicated spindles, controllers, and network ports.</i>
This awesome entry perfectly captures why I have always hated NFS. I can deal with the possibility that if a machine's hard drive dies, my system is going to have a very hard time continuing to operate in a normal manner, but then NFS comes along, and you realize that all sorts of I/O operations that previously employed a piece of equipment that failed once every two and a half years now depend on a working network with a working NFS server on that network, and the combination of that network and that server are orders of magnitude less reliable.<p>And now you have situations on a regular basis where you type "ls" and you shell hangs and not even "kill -9" is going to save you. And you go back to using FTP or some other abstraction that does not apply 40,000 hour MTBF thinking to equipment that disappears for coffee breaks daily.
He didn't touch on Joyent's 2+ day partial outage a couple months ago: <a href="http://news.ycombinator.com/item?id=2269329" rel="nofollow">http://news.ycombinator.com/item?id=2269329</a>
Many things in software are impossible magic, until they are not. His argument boils down to "it is a hard problem that nobody has solved yet." That doesn't mean nobody will ever solve it.<p>Regardless, I do agree that building your application today like it is a solved problem is the wrong way to do it.
It's funny how disk abstractions get you every time.<p>We used to store and process all of our uploads from our rails app on a GFS partition. GFS behaved like a normal disk <i>most</i> of the time, but we started having trouble processing concurrent uploads and couldn't replicate in dev.<p>It turned out so GFS could work at all, it had different locking than regular disks. Every time you created a new file it had to lock the containing folder. We solved it by splitting our upload folder in 1000 sequential buckets and wrote each upload to the next folder along... but it took us a long time to stop assuming it was a regular disk.
Its really fascinating to watch amazon re-learn/re-implement the lessons IBM baked into mainframes decades ago. Once you get out of shared-nothing/web-scripting land you realize that I/O is much more important and difficult than cpu. What amazon calls EBS IBM has been calling "DASD" forever. I wonder if there are any crossover lessons that they haven't taken advantage of because there just aren't any old ibm'ers working at amazon.
<i>Trying to use a tool like iostat against a shared, network provided block device to figure out what your level of service your database is getting from the filesystem below it is an exercise in frustration that will get you nowhere.</i><p>This may be true under Solaris. Since 2.5 Linux has had /proc/diskstats and an iostat that shows the average i/o request latency (await) for a disk, network or otherwise. For EBS it's 40ms or less on a good day. On a bad day it's 500ms or more if your i/o requests get serviced at all.
Amazon Six Sigma "Blackbelts", meet Mr. Black Swan.<p>Edit: my point is you can't hide unexpected/unknown events on statistical models; we should know better, coming from CS.
> It’s commonly believed that EBS is built on DRBD with a dose of S3-derived replication logic.<p>Actually, it was discovered some time ago (<a href="http://openfoo.org/blog/amazon_ec2_underlying_architecture.html" rel="nofollow">http://openfoo.org/blog/amazon_ec2_underlying_architecture.h...</a>) that EBS probably used Red Hat's open-source GNDB: <a href="http://sourceware.org/cluster/gnbd/" rel="nofollow">http://sourceware.org/cluster/gnbd/</a>
He only gets it half right. A filesystem interface instead of a block interface is the right choice IMO. Private storage instead of distributed storage is the wrong choice for capacity, performance, and (most importantly) availability reasons. They didn't go with a ZFS-based solution because it was the best fit to requirements. They went with it because they had ZFS experts and advocates on staff.<p>As Schopenhauer said, every man mistakes the limits of his own vision for the limits of the world, and these are people who've failed to Get It when it comes to distributed storage ever since they tried and failed to make ZFS distributed (leading to the enlistment of the Lustre crew who have also largely failed at the same task). If they can't solve a problem they're arrogant enough to believe nobody can, so they position DAS and SAN as the only possible alternatives.<p>Disclaimers: I'm the project lead for CloudFS, which is IMO exactly the kind of distributed storage people should be using for this sort of thing. I've also had some fairly public disputes with Bryan "Jackass" Cantrill, formerly of Sun and now of Joyent, about ZFS FUD.