A few months ago I got my hadoop cluster stabilized (64 cores with 64 TB hdfs space). It wasn't a pleasant experience. The most memorable frustrating parts of the setup:<p>Searching for help online results in solutions for ancient versions or incomplete wiki pages ("we'll finish this soon!" from three years ago).<p>If Apple is one extreme end of the user friendly spectrum, hadoop is at the polar opposite end -- the error conditions and error messages can be downright hostile. The naming conventions are wonky too: <i>namenode</i> and <i>secondary namenode</i>, but the secondary isn't a backup, it's a copy. And don't get me started on <i>tasktracker</i> versus <i>jobtracker</i> (primarily because I can never remember the difference).<p>Restarting a tracker doesn't make it vanish in the namenode, so you have to restart the namenode too (at least in my CDH3 setup).<p>Everything is held together with duct tape shell scripts.<p>On the good side, I got everything hadoop related managed in puppet. All I need to do for a cluster upgrade is load a new CDH repo, reboot the cluster, then make sure nothing is borked.<p>If I didn't have to deal with isomorphic SQL<->hadoop queries, I'd start over using <a href="http://discoproject.org/" rel="nofollow">http://discoproject.org/</a>
It's good to have critical reviews. But perhaps it could have been better if the tone were more respectful towards a <i>free and open source</i> project.<p>The approach of Hadoop is not my cup of tea, but I praise them for giving for free a working product solving such a hard problem.<p>Playing Hadoop's side, where are the test cases, the patches or bug reports? Or even some missing documentation blurbs, like you mention.
> This means that the Hadoop process which was using 1GB will temporarily "use" 2GB when it shells out.<p>This is exactly not what happens, the new process copies only the parents address mappings (now marked read-only), which represents vastly less than 1gb physical memory.<p>I think only a single 4kb page or two will ultimately be copied, representing a chunk of the calling thread's stack used to prepare the new child before finally calling execve() or similar.
I would be interested to hear why riak, disco etc aren't viable alternatives. I've seen very few good comparisons of the various options (the Mozilla data blog being the only one that comes to mind).
We've seen all of these issues, but the primary causes appear to be either problems in configuration of the cluster (which you typically don't see until you're doing serious work, or working with an overstressed cluster at scale) and problems with underlying code quality of the jobs (e.g., processes hanging due to tasks which don't terminate due to inability to handle unexpected input gracefully--infinite loops or non-terminal operations). If you're working with big data, particularly data that isn't the cleanest, you'll start to see some of these issues and others arise.<p>Despite those issues, the most remarkable thing about Hadoop is the out-of-the-box resilience to get the work done. The strategy of no side-effects and a write-only approach (with failed tasks discarding work in progress) ensures predictable results--even if the time it takes to get those results can't be guaranteed.<p>The documentation isn't the greatest, and it's very confusing sorting out the sedimentary nature of the APIs and configuration (knowing which APIs and config match up across various versions such as 0.15, 0.17, 0.20.2, 0.21, etc., not to mention various distributions from Cloudera, Apache and Yahoo branches), but things are starting to finally converge. You're probably better off starting with one of the later, curated releases (such as the recent Cloudera distribution) where some work has been done to cherry pick features and patches from the main branches.
I've been using Hadoop in production for about 3 years on a small 48 node cluster. I've seen many issues, but none of the ones mentioned in the blog post.<p>My general theory is that if its an important tool for your business, you need at least 1 person to be an expert on it.
The alternative is to pay Cloudera a significant amount per node for support.
Another possible alternative is to use
<a href="http://www.MapR.com/" rel="nofollow">http://www.MapR.com/</a>, they are in beta and claim to be api compatible with Hadoop, but they are not free.
Memory; it's Java. Java is great, but is a memory hog. Doesn't matter so much these days (well it does, but what are you going to do... There isn't much OSS competition for Hadoop). The documentation is horrible though. I'm not sure if this is a 'new' OSS trend, but now that i'm working a lot with Rails and gems I notice that on that side of OSS it's pretty normal to produce no or completely horrible documentation; use the force read the source and that kind of hilarious tripe. Apache projects which are Hadoop related (Hadoop, Pig, Hbase) all suffer from this; you are hard pressed to find anything remotely helpful and not incredibly outdated. At least for rails you can find tons of examples (no docs though) of how to achieve things; for Hadoop/Hbase everything is outdated, not functional and requiring you to jump into tons of code to get stuff done.<p>Again; there is not so much competition for tasks you would accomplish with Hadoop (and Hbase) on the scale it has been tested (by Yahoo/Stumbleupon and many others).
If you want general distributed process management, check out Mesos: <a href="http://www.mesosproject.org/" rel="nofollow">http://www.mesosproject.org/</a><p>The guys working on it are over at Twitter nowadays.
Yeah, hadoop is a bear to get working. One of the benefits of working at a more established company like quantcast, google, or presumably facebook is there are internal infrastructure teams to smooth all that over. I pretty much get to type submitJob and it more or less just works...
<p><pre><code> Hadoop is so old school...
Even Google moving away from Map/Reduce.
Prepare for new shiny things,
which will be better, faster and cheaper to operate!</code></pre>