Don't use Hadoop when your data isn't that big

711 点作者 gcoleman超过 11 年前

63 条评论

w_t_payne超过 11 年前

Hooray! Some sense at last.I have worked for at least 3 different employers that claimed to be using "Big Data". Only one of them was really telling the truth.All of them wanted to feel like they were doing something special.The sad thing is, they were all special, each in their own particular way, but none of what made each company magic and special had anything to do with the size of the data that they were handling.Hadoop was required in exactly zero of these cases.Funnily enough, after I left one of them, they started to build lots of Hadoop-based systems for reasons which, as far as I could fathom, had more to do with the resumes of the engineers involved than the actual technical merits of the case.Sad, but 'tis the way of the world.

评论 #6399169 未加载

评论 #6399045 未加载

评论 #6398848 未加载

评论 #6398834 未加载

mightybyte超过 11 年前

I agree with the general thrust of this article. But hadoop isn't just for scaling up the absolute size of the data set. It is also useful for scaling up the absolute amount of CPU power you can throw at a problem. If I have 1 GB data set, but the computations that I need to do on that data set are complex enough that it would take a single machine a long time to do them, then hadoop is still useful. I gain tremendously by being able to fire up 100 extra large EC2 servers and run my computation much more quickly than I could with SQL or Python on a single machine.Now some might counter this point with the observation others have made here that using hadoop imposes a ~10x slowdown. But even then, my 100 EC2 servers will get the job done 10x faster. Running a job in 1 hour with hadoop is MUCH better than running the same job in 10 hours without it, especially when you're doing data analysis and you need to iterate rapidly.So there is a point where using hadoop is not productive. But that limit is not 5 TB and depends on a lot more variables. Over simplification makes for catchy blog posts, but is rarely the way to make good engineering decisions.

评论 #6399008 未加载

评论 #6400651 未加载

评论 #6401099 未加载

评论 #6404006 未加载

bhauer超过 11 年前

The point of the article that resonates with me is how frequently a technology that is poorly fit with a problem domain is selected because of conventional wisdom rather than data.Related, it is remarkable how we developers routinely cite Knuth's advice about premature optimization to justify our decision when the shoe fits, and then turn around and flatly ignore the advice when it doesn't fit.Selecting Hadoop before you have a specific and concrete need for it--or see that need approaching rapidly on the horizon--is in my experience often and surprisingly coupled with a disdain for other performance characteristics (because Knuth!). The developers prematurely selecting Hadoop as their data management platform will routinely be the same developers who believe it's reasonable for a web application with modest functionality to require dozens of application nodes to service concurrent request load measured in the mere thousands. The sad thing here being that application platforms and frameworks are not all that dissimilar; today, selecting something with higher performance in the application tier is relatively low-friction from a learning and effort perspective. But it's often not done. Meanwhile, selecting Hadoop on the data tier is a substantially different paradigm versus the alternative (as the article points out), so you have some debt incoming once you make that decision. And yet, this is done often enough for many of us to recognize the problem.In my experience, for a modest web application, it's better to focus resources and effort on avoiding common and stupid mistakes that lead to performance and scale pain. Selecting Hadoop too early doesn't really do a whole lot to move the performance needle for a modest web application.Trouble is, many web businesses are blind to the fact that they are a modest concern and not the next Facebook.

评论 #6401898 未加载

评论 #6400429 未加载

评论 #6401813 未加载

hackula1超过 11 年前

99% of the cases I have seen where people have been working with tables that are in the 5+ TB range for analysis, there is some obvious way to compress the data that they have overlooked. Most analysts find some way to aggregate a dataset once, then do actual work on that aggregated dataset, rather than the raw data. In geospatial analytics, for example, a trillion records can be aggregated down to census blocks/block groups so you only have a few million records to deal with. The initial aggregation often takes several days, but after that you can calculate most things in a few seconds with reasonable hardware.

评论 #6399163 未加载

评论 #6399978 未加载

评论 #6401996 未加载

评论 #6400329 未加载

davidmr超过 11 年前

While I couldn't agree more with the general point of the article, I have some small additional comments.Just as a bit of background, I think that Chris would very much agree that I am not the intended recipient of this advice, and so my comments probably aren't keeping in the spirit of the article. I've spent the last 10+ years exclusively in very large HPC environments where the average size of the problem set is somewhere between 500TB and 10PB, and usually much closer to the latter than the former.I think that, for the types of problems Chris mentions, for small data sets, hadoop is as silly a solution as he claims, and for the large map-reduce problem set of the (divide and conquer using simple arithmetic) of 5TB+, he's clearly in the right. Periodically I peruse job postings to see what is out there, and I'm personally ashamed at what many people call "big data", but just because your problem set doesn't fit the traditional model of big data (incidentally, I'm having trouble thinking of a canonical example of big data. perhaps genome sequencing? astronomical survey data?), doesn't mean that a) hadoop is not the right solution, and b) it's best done on a box with a hard drive and a postgres install, pandas/scipy, whatever.We'll be generous and say that 4TB can do 150MB/s. A single run through the data at maximum efficiency will cost you ~8 hours. Since we've restricted ourselves to a single box, we're also not going to be able to keep the data in memory for subsequent calculations/simulations/whatever.Take for example a 4TB data set. It is defined such that it would fit on a 4TB hard drive, but if your problem set involves reading the entire set of data and not just the indexes of a well-ordered schema, you're still going to have a bad time if you want it done quickly, or have a parameterization model that requires each permutation to be applied through the entire sequence of the data rather than chunks you can properly load into memory and then move on, you're going to have a really bad time.I suppose all of this is to say that the amount of required parallelization of a problem isn't necessarily related to the size of the problem set as is mentioned most in the article, but also the inherent CPU and IO characteristics of the problem. Some small problems are great for large-scale map-reduce clusters, some huge problems are horrible for even bigger-scale map-reduce clusters (think fluid dynamics or something that requires each subdivision of the problem space to communicate with its neighbors).I've had a quote printed on my door for years: Supercomputers are an expensive tool for turning CPU-bound problems into IO-bound problems.

评论 #6399628 未加载

评论 #6400022 未加载

评论 #6402814 未加载

评论 #6401385 未加载

beagle3超过 11 年前

Indeed.A rule of the thumb that I've inferred from many installations is: Just introducing Hadoop makes everything 10 times slower AND more expensive than an efficient tool set (e.g. pandas).So it only makes sense to start hadooping when you are getting close to the limit of what you can pandas - everything you do before that is a horrible waste of resources.And when you do get there - often, a slightly smarter distribution among servers and staying with e.g. pandas, will let you keep scaling up without introducing the /10 factor in productivity. Although, it might be unavoidable at some point.

评论 #6398921 未加载

rdtsc超过 11 年前

"We don't have big data" or "our data is rather small" -- said no dev team ever."Big data" is like "cloud" it is a cool label everyone applying to their system. Just like OO was in its time. Well once they applied the label they feel they need to live up to it so well "we gotta use what big data companies use" and they pick Hadoop. I've heard hadoop used when MySQL, SQLite or even flat files would have worked.

评论 #6410438 未加载

评论 #6401793 未加载

评论 #6404202 未加载

mattjaynes超过 11 年前

Novelty Driven Development (NDD)Chris points out a great example of NDD here with Hadoop.I do a lot of client work and I see this mistake CONSTANTLY. So often in fact, that I recently wrote up a story to illustrate the problem. Rather than use a tech example, I use a restaurant and plumbing to drive the point home. When the same scenario is put into the context of something more concrete like physical plumbing, it shows how ridiculous NDD really is.<a href="http://devopsu.com/blog/boring-systems-build-badass-businesses/" rel="nofollow">http://devopsu.com/blog/boring-systems-build-badass-business...</a>

评论 #6399069 未加载

评论 #6401075 未加载

eksith超过 11 年前

"A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and stick it in a desktop computer or server. Then install Postgres on it."Done! Although with more drives and a backup server. Right now, we're pushing 15Tb with no loss in performance.

评论 #6398889 未加载

评论 #6398870 未加载

jamii超过 11 年前

The paper introducing graphchi has dozens of examples of graphchi on a mac mini running rings around 10-100 machine hadoop clusters - <a href="http://graphlab.org/graphchi/" rel="nofollow">http://graphlab.org/graphchi/</a>I worked on a project earlier this year that was envisioned as a hadoop setup. The end result was a 200loc python script that runs as a daily batch job - <a href="https://github.com/jamii/springer-recommendations" rel="nofollow">https://github.com/jamii/springer-recommendations</a>I'm tempted to setup a business selling 'medium-data' solutions, for people who think they need hadoop.

alanctgardner2超过 11 年前

I'm always torn by these headlines: yes, many organizations lack the size of data required to take advantage of Hadoop. Few of the articles really bother explaining the advantages of Hadoop, and how what you're doing really moves the break-even point in terms of data size:- 3x replication: if the data needs to be retained long-term, slapping it on one hard drive isn't going to cut it. This is pretty poor justification by itself, but it's nice to have.- working set: if you only pull 1GB out of your data set for your computations, it makes sense to pull data from a database and run Python locally. If you need to run a batch job across your full, multi-TB data set every day then Hadoop starts looking more attractive- data growth: a company may only have 10GB of data now, but how much do they expect to have in a year? It's important to forecast how much data you'll accumulate in the future. Especially if you want to throw all your logs/clickstreams/whatever into storage.So, if you're expecting explosive growth, you want to hang on to every piece of data ever, or you're going to do a lot of computation across the whole dataset, it makes sense to adopt Hadoop even if your dataset isn't 'big' to start.As for this article, the author undersells MapReduce a bit. Human-written MR jobs can jam a lot of work into those two operations (and a free sort, which is often useful). Using a tool like Crunch can turn really complicated jobs into one or two phases of MR. Once Tez is widely available people won't even write MR anymore, they'll all likely write a 'high-level' language and compile it down to Tez.

评论 #6399255 未加载

noelwelsh超过 11 年前

In these discussions it is mandatory to quote this paper: <a href="http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf" rel="nofollow">http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...</a>:"We completely agree that Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger. However, in this position paper we ask if this is the right path for general purpose data analytics? Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB). Memory has reached a GB/$ ratio such that it is now technically and ﬁnancially feasible to have servers with 100s GB of DRAM. We therefore ask, should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs."Their data is based on Hadoop jobs running at Yahoo, Facebook, and Microsoft -- companies most would agree do have real Big Data -- and they find the median job size is <14GB.

评论 #6401938 未加载

PaulHoule超过 11 年前

A study of jobs submitted to the Yahoo! cluster showed that the median job involved 12GB of data.There's really nothing wrong with that at all, because breaking on 64MB blocks, that 12GB can be processed in parallel, which means turning an answer around really quick on that 12GB, say 30 seconds or so. Usually the work can be scheduled on machines that already have the necessary input, so the network cost is low.Now, it might not be worth it for one hacker to build a Hadoop cluster to do that one job, but if you have a departmental-wide or company-wide cluster you can just submit your jobs, get quick answers, and let somebody else sysadmin.Sure the M/R model is limited, but it's a powerful model that is simple to program. You can write unit tests for Mappers and Reducers that don't involve initializing Hadoop at all, and THAT speeds up development.Yes, it is easy to translate SQL jobs to M/R, but M/R can do things that SQL can't do. For instance, an arbitrary CPU or internet intensive job can be easily embedded in the map or in the reduce, so you can do parameter scans over ray tracing or crack codes or whatever.I built my own Map/Reduce framework optimized for SMP machines and ultimately had my 'shuffle' implementation break with increasing input size. At that point I switched to Hadoop because I didn't plan to have time to deal with scalability problems.<a href="https://github.com/paulhoule/infovore/wiki" rel="nofollow">https://github.com/paulhoule/infovore/wiki</a>With cloud provisioning, you can run a Hadoop cluster for as little as 7.5 cents, so it's a very sane answer for how to get weekly batch jobs done.

评论 #6399899 未加载

mrcactu5超过 11 年前

I remember starting to grasp what "big data" meant when I had a phone interview with Twitter.@ Imagine you have some numbers spread over some computers -- too many to fit in one computer find the median.▪ Uhh, sort them?@ Can you find the median on a single computer without sorting them.▪ :-(@ We'll call you back tomorrow.I was promptly rejected, but it set the tone for my later studies.The criterion for Big Data seems to be that it fits on thousands of computers, perhaps several TB or a PB. Then I had to think of some examples:* A million YouTube Videos* All the tweets in the US in the past 15 minutes* All US tax recordsI still think the map-reduce philosophy is really cool. And I know at that scale there are special counting algorithms (like Bloom Filters) that may lead to some improvements at the GB or MB scales.

评论 #6399807 未加载

评论 #6399530 未加载

评论 #6399710 未加载

gfodor超过 11 年前

The author has a point, but dataset-size is one dimension of several you need to consider when making the choice to use Hadoop.A local python script is great, but what if it takes 2 or 3 hours to run? Now you need to set up a server to run python scripts. What if the data is generated somewhere that would have high locality to a hadoop cluster? Now you need to pull that data down to your laptop to run your job. What if there are a dozen people running similar jobs? Now your python script server is a highly stressed single point of failure. What if the data is growing 100% month-over-month? Your python scripts are going in the trash soon since they were not written in a way that can be easily translated to map-reduce and hadoop-sized workloads are inevitable.The next step up is a centralized database, but in my experience running your own (large, highly used) database is a whole lot harder than just throwing files on S3 and spinning up hadoop clusters on EC2 if you have people that can write pig jobs.A solution like elastic map reduce removes a lot of practical problems such as data distribution, resource management, and system operations beyond the fact that it makes it possible to easily run jobs over terabytes of data at a time.

geertj超过 11 年前

Amen! Finally someone talking sense. Apart from being a hype, Hadoop is also a wet dream for VCs that want to have "exposure" to "big data".Hadoop is a solution for cases where you have multiple petabytes of data, with query's that need to touch a significant portion of your data. Roughly speaking in this case your execution time will scale with the number of nodes in your cluster. Classic example is creating the inverted word list for a search engine.For most other use cases, including all cases where you can index your data, you do not need Hadoop.

ivanprado超过 11 年前

You are right that it should be used the proper tool for each particular problem. And Hadoop world is harder than single machine systems (like pandas). So, you shouldn't user Hadoop if you can do the job with simpler systems.But I have something to add. Hadoop is not only introducing new techniques for distributed storage and computation. Hadoop is also proposing a methodological change in the way a data project is approached.I'm not talking only about doing some analytic over the data, but building an entire data driven system. A good example would be the case of building a vertical search engine, for example for classified ads around the world. You can try to build the system just using a database and some workers dealing with the data. But soon you'll find a lot of problems for managing the system.Hadoop provides you all the storage and processing power that you want (it is matter of money). Why if you build your system in a way where you recompute always everything from the raw input data? That can be seen as something stupid: Why doing that if you can run the system with less resources?The answer is that with this approach you can:- Being human fail-tolerant. If somebody introduces a bug in the system, you just have to fix the code, and relaunch the computation. That is not possible with stateful systems, like those based in doing updates over a database. - Being very agile in developing evolutions. Change the whole system is not traumatic, as you just have to change the code and relaunch the process with the new code without much impact in the system. That is not something simple in database backed systems.The following page shows how a vertical search engine would be built using Hadoop and what would it be its advantages: <a href="http://www.datasalt.com/2011/10/scalable-vertical-search-engine-with-hadoop/" rel="nofollow">http://www.datasalt.com/2011/10/scalable-vertical-search-eng...</a>

评论 #6405017 未加载

bitL超过 11 年前

I couldn't disagree more with some of the statements in the article.Hadoop is not a database! It's a parallel computing platform for MapReduce-style problems that could preserve locality. If your problem fits this, Hadoop absolutely rocks. If your problem is different then please use another tool. If your problem deals for example with high-resolution geographic or LiDAR data that can be easily processed independently (and that would easily give you a few petabytes each scan/flight so you can't stuff it into GPU), Hadoop is about the only open thing you can use to process them reliably (imagine having Earth surface data with the resolution of 1cm and the need to prepare multiple levels of detail, simplify geometry, perform object recognition, identify roads etc.). Even when your data is smaller if your problem fits in the MapReduce model well, Hadoop is a pretty convenient way to be future-proof while enjoying already mature application infrastructure. Why would you even bother working with toys that put everything into memory and then fail miserably in production (HW failure, need to reseed data after crash etc.)?I worked in a company that routinely processed these kinds of data; usually people accepting the thinking from this article hit a wall someday in production, couldn't guarantee reliability and ended up writing endless hacks for their algorithms that didn't scale when it was needed and became frustrating bottlenecks for everyone.Yes, I saw also some ridiculous uses of Hadoop (database that didn't have a chance ever growing over 20M records, problems not fitting MapReduce that needed custom messaging between jobs etc). Just use your reason properly, whatever has the potential for the future to handle large data do it with Hadoop or any appropriate system that supports your algorithmic model (S4, Kafka, OrientDB, Storm etc.) straight away.Make your software future-proof now or you'll have to rewrite it from the scratch when you will be under huge pressure. Don't become complacent with what you "know" now.

评论 #6400802 未加载

jdk超过 11 年前

Back in college in '97 or so, in our databases class on the first day, the professor asked, "Who's worked with databases before?" A bunch of hands went up. "Oh, sorry, let me rephrase, who's worked with databases larger than a few dozen gigs?" Only one or two hands remained up. "If it's smaller than that, just save yourself the effort and use a flat-file instead."15 years later and it's the same thing, plus a few orders of magnitude.

评论 #6399699 未加载

评论 #6400513 未加载

评论 #6399026 未加载

评论 #6398933 未加载

zenbowman超过 11 年前

For a lot of people, what the author says is absolutely right, but I think a lot of the comments here suggesting that only a handful of institutions are solving Hadoop-scale problems is simply inaccurate.Yes, there are companies that are trying to appear more attractive by using Hadoop, but there are plenty of cases where Hadoop is replacing ad-hoc file storage on multiple machines.It's primary use is as a large scale filesystem, so if you are running up against problems storing and analyzing data on a single box, and you feel that the amount of data you have will continue to accelerate, it is a good option for file storage. It doesn't replace your database, it complements it, and there's work being done to allow large-scale databases on top of Hadoop, although the existing ones aren't mature yet. But there's a lot of institutions that are taking on problems that a single-box setup cannot handle.And MapReduce isn't a bad programming model, but it should be thought of as the assembly language of Hadoop. If you are solving a particular problem on Hadoop, writing a DSL for it is the way to go, or see if one of the existing DSLs fits your needs (HIVE, Pig, etc).

dbecker超过 11 年前

I'd love to understood how the Hadoop hype and marketing team generated so much unwarranted interest in Hadoop.I'm witnessing a feeding frenzy for Hadoop talent in situations where there's absolutely no need for Hadoop, and I can't recall anything like this for any other software.

评论 #6398981 未加载

评论 #6398945 未加载

评论 #6399030 未加载

bborud超过 11 年前

Hehe, most of the Hadoop installations I've seen chew through data amounts that I used to process 10 times as quickly using dirty, rotten Perl scripts, sort, cat and flat files :)

lmm超过 11 年前

So the OP claimed Hadoop skills, the interviewer asked him to use Hadoop, gave a him small example problem. He then didn't use Hadoop, and thinks there's something wrong with the interviewer for objecting to this?Interview problems are sometimes kinda artificial, no shit. Given the impracticality of giving every candidate the kind of dataset Hadoop would be needed for, how would the OP suggest an employer test for Hadoop skills?

评论 #6401130 未加载

评论 #6403953 未加载

mindcrime超过 11 年前

On a related note, I find it amusing that people seem to think that "Big Data == Hadoop". In actuality, there are plenty of other approaches to scaling out clusters to handle large jobs, including MPI and OpenMP, as well as BSP (Bulk Synchronous Parallel)[3] frameworks.[1]: <a href="http://en.wikipedia.org/wiki/Message_Passing_Interface" rel="nofollow">http://en.wikipedia.org/wiki/Message_Passing_Interface</a>[2]: <a href="http://en.wikipedia.org/wiki/OpenMP" rel="nofollow">http://en.wikipedia.org/wiki/OpenMP</a>[3]: <a href="http://en.wikipedia.org/wiki/Bulk_synchronous_parallel" rel="nofollow">http://en.wikipedia.org/wiki/Bulk_synchronous_parallel</a>

评论 #6402101 未加载

评论 #6400197 未加载

rozim超过 11 年前

See 'Nobody ever got ﬁred for using Hadoop on a cluster' <a href="http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf" rel="nofollow">http://research.microsoft.com/pubs/163083/hotcbp12%20final.p...</a>

jacquesm超过 11 年前

Spot on. I recently audited a project that was using an over the top technical solution for a problem that would - with only minor nuts and bolts work - have fit easily on a single machine instead of on a cluster, and it would have run much faster too. Demonstrating this made the case and they've since happily converted. You can buy off-the-shelf machines with 256G of ram at reasonable (for large values of reasonable) cost with IO speeds to match if you equip them with SSDs.Big Data to me means at a minimum 10's of T, and what big data means changes over time, so todays big data will fit the laptop of the day after tomorrow.

paul_f超过 11 年前

This can be a confusing topic. Hadoop is several things. A No-SQL data store, map-reduce and a global file system. No-SQL and Map-Reduce can be quite valuable, even on a single server. CouchDB runs on Android for example.If you don't need a global file system, use MariaDB, CouchDB or Mongo depending on your use-case.

naiquevin超过 11 年前

I have no experience with hadoop at all and it may be slightly off topic but this reminds me of a post titled "Taco bell programming"[1], after reading which I started learning and using Unix tools and commands much more than before instead of writing silly Python scripts for almost anything that needed automation.[1]: <a href="http://web.archive.org/web/20110220110013/http://teddziuba.com/2010/10/taco-bell-programming.html" rel="nofollow">http://web.archive.org/web/20110220110013/http://teddziuba.c...</a>

sitkack超过 11 年前

Hadoop is the problem, MapReduce is not the problem. Having used both Hadoop and Disco, I can say that Disco is by far a net positive on all projects I used it on. And the overhead to coding it in Disco vs single node is about an extra 30 minutes. You can start with working single node and go multinode w/o much effort.<a href="http://discoproject.org" rel="nofollow">http://discoproject.org</a>Hadoop on the other hand is a huge, massive pain in the ass. And I am a Hadoop consultant. I recommend that most customers NOT use it.

评论 #6404125 未加载

评论 #6405014 未加载

评论 #6403311 未加载

lgieron超过 11 年前

The article focuses on data sizes and completely ignores per-row computations time requirements. In my case, our dataset is just 1-2 tb, but we need hundreds of cores to process it within reasonable timeframe - hence hadoop.

deathflute超过 11 年前

Well written article. I think most people who do not have a background in data are unaware of the various options out there and fall for the marketing behind hadoop like tools.I would urge people doing analytics to take a look at kdb+ from kx. Unless you have ridiculously large amounts of data(> 200 TB), I can bet that you would be better off with kdb. The only downside is that it costs a lot of money which is a pity.

评论 #6402157 未加载

karl_gluck超过 11 年前

I love how whoever mods this site now doesn't even have to follow their own rule about not editorializing headlines.Mods: Don't be hypocrites. If you're going to enforce your "only use the source title" trash on us, follow it yourself.Original title: "Don't use Hadoop - your data isn't that big"Mod-invented title: "Don't use Hadoop when your data isn't that big"

评论 #6401171 未加载

acidity超过 11 年前

I have been looking into building simple recommendation engine (i have at max million data rows) using Python. I looked into Crab (<a href="https://github.com/muricoca/crab" rel="nofollow">https://github.com/muricoca/crab</a>) and it seems to not have updated for 2 years.Any suggestions for libraries or just use basic numpy/scipy and implement the algorithms?

评论 #6403454 未加载

评论 #6401781 未加载

jimbokun超过 11 年前

Does anyone use Hadoop for job management?We have millions of XML documents in a document database. Many of the questions we want to ask about those documents can be answered through the native database querying capabilities.But there are always questions falling outside the scope of the query capabilities, that could be answered by a simple map function applied to each document, with a reduce to combine the results.Seems like a pain to always query for the documents you want to process, find some place to store them on disk, then run a program locally to get the result, vs. writing a map and reduce job then pointing it at the documents in the database to run against (this document store has a Hadoop integration API). Hadoop also seems to have a lot of nice job frameworks and monitoring tools and APIs to track job progress.Anyone have similar situation where you used Hadoop just to get job management and tracking and flexibility in performing data analysis tasks? Are there easier ways to accomplish this goal?

评论 #6400708 未加载

rcavezza超过 11 年前

I understand and agree with the author's main point that many companies that use big data do not need to use these technologies.I do not agree that the tools are inferior to Sql. Hive is really close to sql and Pig is extremely powerful. I would take a look at a few of the recent updates to these tools before declaring them inferior to Sql.

评论 #6398855 未加载

评论 #6399015 未加载

ianstallings超过 11 年前

I used a big data option on my last project because marketing expectations were gigantic and the hype around the project was also enormous. Looking back it was a poor choice because the expectations never panned out and we could have saved some time and effort using a more traditional and well known SQL database like PostgreSQL. Before that I had a fairly large project with ~1M user profiles running with no problems on PostgreSQL. I think it would have handled the latest project with ease and could be sharded and scaled when to handle the growth it's seeing now.But marketing was insisting we use big data because "regular" databases couldn't handle such enormous possibilities. I'll never believe that nonsense again. It wasn't a terrible ending but it was more hassle than it was worth IMHO. At least I got some resume material out of it..

aheilbut超过 11 年前

And even if it is, consider using Spark/Shark instead.

sologoub超过 11 年前

This goes back to the old adage "the right tool for the job".As @davidmr points out, there are jobs on smaller data sets that can still benefit from the distributed nature of HPC.That said, my own experience with startups echoes much more what OP writes - python scripts and csv processing has saved me days of headaches in resource constrained environments. I was able to quickly produce analysis and make crucial scaling decisions using data that would have taken days of engineering resources to produce. I happened to have the right tool for that job handy, and it worked out great.You really have to think things through before restricting yourself to any specific direction.

iblaine超过 11 年前

I guess I haven't been around enough small startups (fewer than 100 people) because I hardly get the sense that people are haphazardly spinning hadoop clusters. People generally pick the right tools for the right jobs. This is particularly true in the datawarehousing world.That being said I can see problems where people pick hadoop without knowing how it's going to integrate into their systems 1-3 years down the road. Particularly with cloud computing these days, you can easily bring large complex systems online with little effort. It's cool and scary at the same time.

agibsonccc超过 11 年前

Very good points. I liked learning how to use hadoop academically, but it's not an end all be all tool.If you want to do map reduce, it's perfectly reasonable just to use something smaller scale on multiple cores to do some kind of data processing.Another thing is real time processing, using something like storm (<a href="http://storm-project.net/" rel="nofollow">http://storm-project.net/</a>) or even the parallelism based systems like Akka on the JVM or Go will allow you to have adequate performance. Hadoop has a lot of overhead in not only operations but also job startup.

capkutay超过 11 年前

This isn't directly related, but is hadoop the only java-based solution to parallel computing? I've seen some examples of people attempting to do work in java mpi[0] again. It seems like dealing with the gc and memory management in general has been an issue when trying to do high performance computing in java, especially at a distributed scale.0:<a href="http://blogs.cisco.com/performance/mpi-and-java-redux/" rel="nofollow">http://blogs.cisco.com/performance/mpi-and-java-redux/</a>

calinet6超过 11 年前

My company's data is that big, thank you very much.But if yours isn't, sure, don't use a system designed for handling astronomical data. Should be common sense, probably isn't.

CurtMonash超过 11 年前

While the point in the headline is fine, the supporting reasoning is in places dubious.Don't use SQL for anything over 5 TB? Huh? You can put a lot more data than that on a node with a nice open source columnar analytic DBMS, and of course there are a lot of MPP relational analytic DBMS as well.SQL on Hadoop requiring full table scans? Well, that's what Impala is for. Hadapt is more mature than Impala. Stinger is coming on, and is open source.Nice marketing line, however.

liranz超过 11 年前

So true! I would 'up' it twice if I could.Too many startups go over to Hadoop/no-sql solutions before the overhead is indeed justified. SQL for most of the data with a bit of Redis and numpy for background processing will take you much further than most people assume.It's fun to think you must have DynamoDB, Hadoop or a Cassandra backend, but in real life -- you better invest in more features (or analytics!)

fuziontech超过 11 年前

I agree with this guy's main point of using the right tool for the job, but he hates on Hadoop waaay too much. Even overlooking how simple you can make building out mapreduce jobs in python using something like MRJob, or just using Hive if SQL is really your fancy. Hadoop has its place, and as with any tool, can be the hammer that makes everything look like a nail.

nfa_backward超过 11 年前

The author is missing a big gap between 5TB - 1PB. For most workloads, I would not look to Hadoop at the 5TB+ scale of data. I would first look at Impala or Redshift.<a href="http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/" rel="nofollow">http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...</a>

leif超过 11 年前

More than 5TB doesn't mean you need Hadoop either, it means you need a better storage system like tokudb/tokumx or leveldb. These technologies can index the data so that you can run selective queries instead of reading all the data for every query like Hadoop would, and they can compress it so that you can keep everything on one disk a while longer.

antonmks超过 11 年前

Big advantage of Hadoop is that you do not need to pay license fees which in case of relational databases can reach into tens of thousands dollars.The big disadvantage is that Hadoop is two orders of magnitude slower than relational databases. Also, Hadoop clusters are not what one would call a "green" solution. More like a terrible waste of computing resources.

评论 #6398964 未加载

fiatmoney超过 11 年前

Hadoop / MapReduce was invented for situations where the data is being generated on the machines (eg, via a distributed web crawl). If you're not generating the data in situ and have to ETL it anyway, it makes just as much sense to load the data onto one Monster Box with a terabyte of RAM and 48 CPU cores. You massively save on complexity.

CmonDev超过 11 年前

There is no problem with CV-driven development.

评论 #6398911 未加载

TeeWEE超过 11 年前

I sort of agree. However when you think you will grow into the TB range, you just as well do it in hadoop right away.We are using Hive and HiveQL and have SQL like queries which generate the correct output. The result is: we dont have to hassle with the hadoop mappers and reducers. And we can write our "queries" in a human readable fashion.

评论 #6398902 未加载

krosaen超过 11 年前

Great points in this article - though using something like cascalog makes hadoop suck a lot less - e.g composable more complex queries closer in power to sql. Wouldn't be quite as crazy to use on smaller datasets be it just for fun or to prove you are ready if/when your dataset grows large enough.

progx超过 11 年前

Thx! Like always: use the right tool for the job. But many customers dont understand this simple rule.

评论 #6401841 未加载

lemmsjid超过 11 年前

While there is a point to be made here, this article does not make it. Or perhaps it goes too far in attempting to make it, to the point where I feel like it might tip people in the wrong direction.The point of the article is taken if:A) Your data is not large. B) You aren't creating large intermediary datasets with the data. C) You aren't running an increasingly large number of analysis jobs on the data. D) Your computational overhead is small. E) Your memory overhead is small (this requires an asterix, because some tasks that require extreme amounts of memory will not work well in hadoop and should be brought outside) F) You don't need or want a system to track the increasingly large number of analysis jobs you're running. G) You can guarantee you won't outgrow A, B, C which will force you to rewrite all your code.G is especially difficult because it's hard to predict. F is always underestimated at the beginning of a project and bites you later. Yes you can write analysis scripts--what happens when there are a hundred of them, written by different developers? Time to write a job tracking system, with counters, retries, notification, etc. Like Hadoop.To further D and E, there are workloads that are relatively straightforward across terrabytes of data, and there are workloads that are expensive over gigabytes of data (especially those involving the creation of intermediate indices, which is where MR itself speeds things up considerably, esp if done in parallel).Also, in a critique of Hadoop the article obsesses over MapReduce (in a way, conflating Hadoop and MapReduce, just as it conflates 'SQL' with a 'SQL database'), ignoring the increasingly powerful tools that can be used, such as Hive, Pig, Cascading, etc. Do those tools beat a SQL database in flexibility? The answer is that the question is not really relevant. If you already understand the nature of your data, and you've gone through the very difficult act of designing a normalized schema that fits what you need, then you're in a good place. If you have a chunk of data in which the potential has not yet been unlocked, or in which the act of writing to it happens to quickly to justify the live indexing implied by a database, then Hadoop is an essential tool. They really sit next to one another.None of this is to knock writing analysis scripts against local data. I do that all the time. In fact often I'll ship data from HDFS to the local system so I can write and run a script. I just think it's important at a company to make sure your people have access to good tools so there aren't hurdles in front of them, and when it comes to data analysis I've come to the opinion that you really want to have a Hadoop cluster set up next to your SQL databases and your other tooling, because it will become useful in sometimes unpredictable ways.Yes, if there are a few hundred megabytes in front of you and you need to analyze them, then write a script--and were I interviewing someone for a job I would not hesitate to accept a script that solves a data analysis task, so clearly the people being interacted with by the author were somewhat myopic. But most companies require that an ecosystem be built to handle the increasing complexity that will ensue over the years. And Hadoop is a huge bootstrap to that ecosystem, regardless of data size.

评论 #6403973 未加载

akanet超过 11 年前

Perhaps this is the beginning of the resurgence of the "Small Data Expert"

trimbo超过 11 年前

Another chance to plug for my favorite command-line small data toolset, Crush Tools. I used it often at Groupon.<a href="https://code.google.com/p/crush-tools/" rel="nofollow">https://code.google.com/p/crush-tools/</a>

samspenc超过 11 年前

Great article. I would say 1 TB is really the absolute minimum at which you should start considering Hadoop - for anything less, use Python or variants (for flat files) or MySQL or PostGRES (for relational data.)

rburhum超过 11 年前

You would be surprised to find out how much processing 400GB of geodata needs sometimes. 7 days in a normal machine to do some non-trivial analysis of OpenStreetMap data. Hadoop can reduce that to hours instead.

sgt101超过 11 年前

Key point : Hadoop storage is bloody cheap compared with a SAN, as in $1kTB vs $15kTB for enterprise storage, and it costs a pittance per year compared with licensing costs.Only problem is backing the bugger up.

angeladur超过 11 年前

Yes the distinction is important - Is it a "Big Data" problem or is it a "Big" data problem

mufumbo超过 11 年前

was that a job interview? You probably shouldn't use your favorite super-scripting language on a job interview that was asking for a more robust answer. They were maybe looking for someone to transform their hacky platform to something more robust.

richardlblair超过 11 年前

Actually, it is. Thanks for the blanket statement, though.

aaron695超过 11 年前

Why is the title "Don't use Hadoop when your data isn't that big"When the article is "Don't use Hadoop - your data isn't that big"Two totally different points?And I'm sure it was correct to begin with.