We've seen all of these issues, but the primary causes appear to be either problems in configuration of the cluster (which you typically don't see until you're doing serious work, or working with an overstressed cluster at scale) and problems with underlying code quality of the jobs (e.g., processes hanging due to tasks which don't terminate due to inability to handle unexpected input gracefully--infinite loops or non-terminal operations). If you're working with big data, particularly data that isn't the cleanest, you'll start to see some of these issues and others arise.<p>Despite those issues, the most remarkable thing about Hadoop is the out-of-the-box resilience to get the work done. The strategy of no side-effects and a write-only approach (with failed tasks discarding work in progress) ensures predictable results--even if the time it takes to get those results can't be guaranteed.<p>The documentation isn't the greatest, and it's very confusing sorting out the sedimentary nature of the APIs and configuration (knowing which APIs and config match up across various versions such as 0.15, 0.17, 0.20.2, 0.21, etc., not to mention various distributions from Cloudera, Apache and Yahoo branches), but things are starting to finally converge. You're probably better off starting with one of the later, curated releases (such as the recent Cloudera distribution) where some work has been done to cherry pick features and patches from the main branches.