Let's Build a Modern Hadoop

146 pointsby jaz46over 10 years ago

19 comments

The article makes many good points but also misses on a few in my opinion. Hadoop was designed to solve a pretty narrow problem a long time ago; modern use cases tend to be far outside its intended design envelope.Being more explicitly Linux-centric and dropping the JVM paves the way for some major performance optimizations. In practice, virtually everyone uses Linux for these kinds of applications so portability is not a significant concern. Efficiency in the data center is becoming a major concern, both in terms of CapEx and OpEx, and the basic Hadoop stack can be exceedingly suboptimal in this regard. Free software is not cost effective if you require 10-100x the hardware (realistic) of a more optimal implementation.However, I would use a different architecture generally rather than reimplementing the Hadoop one. The Achille's Heel of Hadoop (and Spark) is its modularity. Throughput for database-like workloads, which matches a lot of modern big data applications, is completely dependent on tightly scheduling execution, network, and storage operations without any hand-offs to other subsystems. If the architecture requires network and storage I/O to be scheduled by other processes, performance falls off a cliff for well-understood reasons (see also: mmap()-ed databases). People know how to design I/O intensive data processing engines, it was just outside the original architectural case for Hadoop. We can do better.A properly designed server kernel running on cheap, modern hardware and a 10 GbE network can do the following concurrently on the same data model:- true stream/event processing on the ingest path at wire speed (none of that "fast batch" nonsense)- drive that ingest stream all the way through indexing and disk storage with the data fully online- execute hundreds of concurrent ad hoc queries on that live data modelThese are not intrinsically separate systems, it is just that we've traditionally designed data platforms to be single purpose. However, it does require a single scheduler across the complete set of operations required in order to deliver that workload.From a software engineering standpoint it is inconvenient that good database kernels are essentially monolithic and incredibly dense but it is unavoidable if performance matters.

评论 #9030712 未加载

评论 #9030763 未加载

评论 #9034449 未加载

batbombover 10 years ago

> Second, Etcd and Fleet are themselves designed to be modular, so it’s easy to support a variety of other deployment methods as well.Zookeeper isn't Modular? No mention of HTCondor+Dagman? No mention of Spark?I've written a general DAG processor/workflow engine/metascheduler/whatever you want to call it. It's used by various physics and astronomy experiments. I've interfaced it with the Grid/DIRAC, Amazon, LSF, Torque, PBS, SGE, and Crays. There's nothing in it that precludes Docker jobs from running. I've implemented a mixed-mode (Streaming+batch processing + more DAG) version of it which just uses job barriers to do the setup and ZeroMQ for inter-job communication. I think something like Spark's resilient distributed dataset would be nice here as well.We don't use hadoop because, as was said, because it's narrow in scope and it's not a good general purpose DAG/Workflow Engine/Pipeline.I think this is a small improvement, but I don't really see it being much better than hadoop, or HTCondor, or Spark.HTCondor is pretty amazing. I think somebody should be building a modern HTCondor, not a modern Hadoop.

评论 #9031138 未加载

评论 #9031160 未加载

评论 #9032540 未加载

评论 #9031126 未加载

lmmover 10 years ago

The JVM runs every language I care about, in a mature system with a simple, stable interface. It has its problems (startup time), but for the kind of jobs you use Hadoop for it's rarely an issue.Switching to Docker feels like a real step backwards. Rather than a jar containing strictly-specified bytecode, cluster jobs are going to be random linux binaries that could do any combination of system calls. You need another layer of build/management system to make your docker containers useable. Worst of all, rather than defining common shared services for e.g. job scheduling, this is going to encourage users to put any random component in their docker image. We'll end up with a more fragmented ecosystem, not less.

评论 #9031344 未加载

评论 #9031106 未加载

评论 #9033087 未加载

rkwasnyover 10 years ago

Scheduling jobs by distributing docker container? Looks really heavyweight.There is a good reason most of Map-Reduce frameworks are written on top of JVM - it is VERY easy to serialize Java/Scala code ship it to remote host and execute there.This is not trivial in compiled languages ( I believe some hacks for Haskell exists )What we need in Big Data space is more people that understand how computers really work. Some effort must be put into supporting TCP/IP Stacks in Userspace using new 10GBe cards or support for IB verbs. Also the long forgotten art of writing zero-copy code.

评论 #9032804 未加载

评论 #9037938 未加载

ericfrenkielover 10 years ago

I find it very odd that the author completely glossed over the recent Spark developments in the Hadoop ecosystem. In many ways, Spark is meant to replace the Map-Reduce paradigm to enable easier access to the data.

评论 #9030615 未加载

评论 #9030369 未加载

CurtMonashover 10 years ago

This sounds bizarre. Spark is the replacement for MapReduce, with Map and Reduce being just 2 of its 15 (?) primitives. Who needs another MapReduce engine?The bit about "We'll get around to Hive and Pig eventually" is also odd, as they're approaching legacy status too. (Albeit not there yet; please recall that I'm somewhat of an Impala skeptic, and of course there's the Tez booster to Hive.)Also bizarre are some of the history claims in the article, e.g. that YARN is part of more or less original Hadoop.

评论 #9031747 未加载

trhwayover 10 years ago

Writing new implementation of a batch processing misses the main point that batch processing in general is just a trade-off made in order to get things started. This is why new generations of systems are going the way of online and/or stream processing.

评论 #9030527 未加载

评论 #9030577 未加载

kev009over 10 years ago

No disclosure of inspiration from Manta <a href="https://github.com/joyent/manta" rel="nofollow">https://github.com/joyent/manta</a>?Also, why yet another DFS that makes excuses for shitty local filesystems instead of leveraging ZFS? Use ZFS as the base for integrity, snapshots, send/receive and make a small service on top for managing cluster metadata.

评论 #9031468 未加载

bipin_nagover 10 years ago

I agree with your findings on Hadoop. It is a very complicated ecosystem. The core is tied to JVM and abstractions of map-reduce are heavy and difficult to use. There are multiple domain specific languages/api to do the same thing : pig, cascading. Yet if you pick one you cant use other. Hadoop feels like tower of babel.But I am not in the favour of creating another stack as Hadoop replacement. Spark fills the role of distributed computation framework very well. It uses scala a functional language which provides neat abstractions for map,reduce,filter and other operations. It introduces RDD to abstract data. All the components graphx, sparksql, blinkdb use scala.Also building a new map-reduce engine is a difficult task. The computation engine will lie in the middle of stack not on top as shown. Libraries will sit on top of that, so if you build an engine you will have to build everything on top of that.See the Spark ecosystem : <a href="https://amplab.cs.berkeley.edu/software/" rel="nofollow">https://amplab.cs.berkeley.edu/software/</a>

eaxitectover 10 years ago

TL,DR; Although I'm not a fan of entire Hadoop system, some judgements about it is totally wrong.1) Do one thing well - Hadoop is so stacked up because of this (which has pros and cons). Every single component of Hadoop is designed specific to a job (there are many overlapped components, though)2) Hadoop's main idea is to be data-local, yet this Pachy thing not to be: "Data is POSTed to the container over HTTP and the results are stored back in the file system". Data-locality (almost) always win.3) There is no scheduler like YARN specified, yet fleet and etcd are not schedulers.4) You may not like Java or JVM-based things (like I do), but there is no problem of being Java-based or JVM-based.…

spennantover 10 years ago

Funny... I just put up a hadoop cluster last week and named it pachyderm.mydomain.com (i'm probably not the first to do that either)

sytseover 10 years ago

Sounds awesome, do you have any performance stats?

评论 #9030736 未加载

thinkmooreover 10 years ago

Still trying to wrap my head around the idea that 10 year old software is not "modern."That said, I have no opinion on this article.

ameliusover 10 years ago

Personally, I'd like to run purely functional jobs, and get progress information while they are running (this is not as easy as it sounds). Automatic memoization of jobs. And automatic killing of jobs and removal of data when results are no longer required (this should work transitively/recursively).

评论 #9033072 未加载

pjmlpover 10 years ago

It is not modern if GNU/Linux is the only deployment option.I love the ability to run the cluster in any OS that has a JVM available, from big commercial UNIXes, Windows, *BSD and GNU/Linux distributions.

KaiserProover 10 years ago

Sorry, but using HTTP for transport is just terrible.Hadoop is a poor imitation of a mainframe. When you are processing large lumps of data you need the following:A task scheduler A message passing system(although in some workloads this can be proxied by the scheduler or filesystem) File storageIn mainframe land, you have a large machine with many processors all linked to the same kernel. As its one large machine, there is a scheduler that's so good you don't even know its there. However mainframes require specialist skills.If you want shared memory cluster, you're going to need inifiband (which is actually quite cheap now, and has 40gig ethernet type majigger in it aswell)but people are scared of that, so that leaves us with your standard commodity cluster.In VFX we have this nailed. before leaving the industry I was looking after a ~25k CPU cluster backed to 15pb of storage. No fancy map reduce bollocks, just plain files and scripts. Everything was connected by ten gigs ethernet. Sustained write rate of 25 gigabytes a second.firstly task placement is dead simple: <a href="https://github.com/mikrosimage/OpenRenderManagement/wiki/Introduction-to-job-submission" rel="nofollow">https://github.com/mikrosimage/OpenRenderManagement/wiki/Int...</a> look how simple that task mapping is. It is legitimate to point out that there are no primitives for IPC, but then your IPC needs are varied. Most log churn apps don't really need IPC, they just need a catcher to aggregate the results. Map:reduce has fooled people into thinking that everything is must be that way.This is designed to run on bared metal. Anything else is a waste of CPU/IO. We used cgroups to limit process memory, so that we could run concurrent tasks on the same box. But everything ran the same image, anything else is just a plain arse ache.The last item is file storage. HDFS is just terrible. the reason that its able to scale is that it provides literally nothing. Its just a key:value store, based on a bunch of key:value stores.If you want fast super scalable posix filesystem, then use GPFS(or elastic storage). The new release has a kind of software raid that allows you to recover from a dead disk in less than an hour. (think ZFS with shards)failing that you can use lustre (although that requires decent hardware raids, as its a glorified network raid0)or if you are lucky you can split your file system into name spaces that reside on different NFS servers.Either way, Hadoop, and hadoop inspired infrastructures are normally not the right answer. Measos is also probably not the answer unless you want a shared memory cluster. but then if you want this you need inifinband...

bra-ketover 10 years ago

fantastic idea, unfortunate name (Pachyderm, seriously?)

TeMPOraLover 10 years ago

Excuse me for going off-topic, but heck, Hadoop itself was released 3 years ago. It's a testament to how insanely fast things change in web that Hadoop is no longer considered "modern"...

michaelochurchover 10 years ago

This sounds promising.I'd love to see a Haskell-centric (performance and robust code) but ultimately language-agnostic alternative to the JVM-heavy stack that's currently in vogue. (Spark is a promising and powerful technology.) I choose Haskell because pretty much anything that the JVM does well (e.g. concurrency) Haskell also does well and, while I like Clojure and can enjoy Scala when written by competent people, Java is ugly and its community has worse aesthetic sense than a taint mole.

评论 #9031472 未加载

评论 #9032266 未加载