The Dark Side of Hadoop

132 pointsby nadahalliabout 14 years ago

13 comments

seijiabout 14 years ago

A few months ago I got my hadoop cluster stabilized (64 cores with 64 TB hdfs space). It wasn't a pleasant experience. The most memorable frustrating parts of the setup:Searching for help online results in solutions for ancient versions or incomplete wiki pages ("we'll finish this soon!" from three years ago).If Apple is one extreme end of the user friendly spectrum, hadoop is at the polar opposite end -- the error conditions and error messages can be downright hostile. The naming conventions are wonky too: namenode and secondary namenode, but the secondary isn't a backup, it's a copy. And don't get me started on tasktracker versus jobtracker (primarily because I can never remember the difference).Restarting a tracker doesn't make it vanish in the namenode, so you have to restart the namenode too (at least in my CDH3 setup).Everything is held together with duct tape shell scripts.On the good side, I got everything hadoop related managed in puppet. All I need to do for a cluster upgrade is load a new CDH repo, reboot the cluster, then make sure nothing is borked.If I didn't have to deal with isomorphic SQL<->hadoop queries, I'd start over using <a href="http://discoproject.org/" rel="nofollow">http://discoproject.org/</a>

评论 #2475204 未加载

评论 #2475597 未加载

评论 #2475988 未加载

评论 #2475388 未加载

aleccoabout 14 years ago

It's good to have critical reviews. But perhaps it could have been better if the tone were more respectful towards a free and open source project.The approach of Hadoop is not my cup of tea, but I praise them for giving for free a working product solving such a hard problem.Playing Hadoop's side, where are the test cases, the patches or bug reports? Or even some missing documentation blurbs, like you mention.

评论 #2475270 未加载

评论 #2475297 未加载

forgotusernameabout 14 years ago

> This means that the Hadoop process which was using 1GB will temporarily "use" 2GB when it shells out.This is exactly not what happens, the new process copies only the parents address mappings (now marked read-only), which represents vastly less than 1gb physical memory.I think only a single 4kb page or two will ultimately be copied, representing a chunk of the calling thread's stack used to prepare the new child before finally calling execve() or similar.

评论 #2475218 未加载

jamiiabout 14 years ago

I would be interested to hear why riak, disco etc aren't viable alternatives. I've seen very few good comparisons of the various options (the Mozilla data blog being the only one that comes to mind).

评论 #2475814 未加载

firebonesabout 14 years ago

We've seen all of these issues, but the primary causes appear to be either problems in configuration of the cluster (which you typically don't see until you're doing serious work, or working with an overstressed cluster at scale) and problems with underlying code quality of the jobs (e.g., processes hanging due to tasks which don't terminate due to inability to handle unexpected input gracefully--infinite loops or non-terminal operations). If you're working with big data, particularly data that isn't the cleanest, you'll start to see some of these issues and others arise.Despite those issues, the most remarkable thing about Hadoop is the out-of-the-box resilience to get the work done. The strategy of no side-effects and a write-only approach (with failed tasks discarding work in progress) ensures predictable results--even if the time it takes to get those results can't be guaranteed.The documentation isn't the greatest, and it's very confusing sorting out the sedimentary nature of the APIs and configuration (knowing which APIs and config match up across various versions such as 0.15, 0.17, 0.20.2, 0.21, etc., not to mention various distributions from Cloudera, Apache and Yahoo branches), but things are starting to finally converge. You're probably better off starting with one of the later, curated releases (such as the recent Cloudera distribution) where some work has been done to cherry pick features and patches from the main branches.

dougbabout 14 years ago

I've been using Hadoop in production for about 3 years on a small 48 node cluster. I've seen many issues, but none of the ones mentioned in the blog post.My general theory is that if its an important tool for your business, you need at least 1 person to be an expert on it. The alternative is to pay Cloudera a significant amount per node for support. Another possible alternative is to use <a href="http://www.MapR.com/" rel="nofollow">http://www.MapR.com/</a>, they are in beta and claim to be api compatible with Hadoop, but they are not free.

tluyben2about 14 years ago

Memory; it's Java. Java is great, but is a memory hog. Doesn't matter so much these days (well it does, but what are you going to do... There isn't much OSS competition for Hadoop). The documentation is horrible though. I'm not sure if this is a 'new' OSS trend, but now that i'm working a lot with Rails and gems I notice that on that side of OSS it's pretty normal to produce no or completely horrible documentation; use the force read the source and that kind of hilarious tripe. Apache projects which are Hadoop related (Hadoop, Pig, Hbase) all suffer from this; you are hard pressed to find anything remotely helpful and not incredibly outdated. At least for rails you can find tons of examples (no docs though) of how to achieve things; for Hadoop/Hbase everything is outdated, not functional and requiring you to jump into tons of code to get stuff done.Again; there is not so much competition for tasks you would accomplish with Hadoop (and Hbase) on the scale it has been tested (by Yahoo/Stumbleupon and many others).

apiabout 14 years ago

I never got the Hadoop craze. It performs terribly on memory and CPU, which is odd for something that prides itself on being HPC-oriented.

评论 #2476852 未加载

kordlessabout 14 years ago

If you want general distributed process management, check out Mesos: <a href="http://www.mesosproject.org/" rel="nofollow">http://www.mesosproject.org/</a>The guys working on it are over at Twitter nowadays.

ssnabout 14 years ago

Shouldn't Hadoop users help to document it and contribute to it? Why isn't this happening?

评论 #2476335 未加载

rjurneyabout 14 years ago

I'd like to know what distribution they are running, and if they've tried others.

评论 #2475051 未加载

earlabout 14 years ago

Yeah, hadoop is a bear to get working. One of the benefits of working at a more established company like quantcast, google, or presumably facebook is there are internal infrastructure teams to smooth all that over. I pretty much get to type submitJob and it more or less just works...

评论 #2475543 未加载

评论 #2475203 未加载

nivertechabout 14 years ago

<pre><code> Hadoop is so old school... Even Google moving away from Map/Reduce. Prepare for new shiny things, which will be better, faster and cheaper to operate!</code></pre>