While there is a point to be made here, this article does not make it. Or perhaps it goes too far in attempting to make it, to the point where I feel like it might tip people in the wrong direction.<p>The point of the article is taken if:<p>A) Your data is not large.
B) You aren't creating large intermediary datasets with the data.
C) You aren't running an increasingly large number of analysis jobs on the data.
D) Your computational overhead is small.
E) Your memory overhead is small (this requires an asterix, because some tasks that require extreme amounts of memory will not work well in hadoop and should be brought outside)
F) You don't need or want a system to track the increasingly large number of analysis jobs you're running.
G) You can guarantee you won't outgrow A, B, C which will force you to rewrite all your code.<p>G is especially difficult because it's hard to predict. F is always underestimated at the beginning of a project and bites you later. Yes you can write analysis scripts--what happens when there are a hundred of them, written by different developers? Time to write a job tracking system, with counters, retries, notification, etc. Like Hadoop.<p>To further D and E, there are workloads that are relatively straightforward across terrabytes of data, and there are workloads that are expensive over gigabytes of data (especially those involving the creation of intermediate indices, which is where MR itself speeds things up considerably, esp if done in parallel).<p>Also, in a critique of Hadoop the article obsesses over MapReduce (in a way, conflating Hadoop and MapReduce, just as it conflates 'SQL' with a 'SQL database'), ignoring the increasingly powerful tools that can be used, such as Hive, Pig, Cascading, etc. Do those tools beat a SQL database in flexibility? The answer is that the question is not really relevant. If you already understand the nature of your data, and you've gone through the very difficult act of designing a normalized schema that fits what you need, then you're in a good place. If you have a chunk of data in which the potential has not yet been unlocked, or in which the act of writing to it happens to quickly to justify the live indexing implied by a database, then Hadoop is an essential tool. They really sit next to one another.<p>None of this is to knock writing analysis scripts against local data. I do that all the time. In fact often I'll ship data from HDFS to the local system so I can write and run a script. I just think it's important at a company to make sure your people have access to good tools so there aren't hurdles in front of them, and when it comes to data analysis I've come to the opinion that you really want to have a Hadoop cluster set up next to your SQL databases and your other tooling, because it will become useful in sometimes unpredictable ways.<p>Yes, if there are a few hundred megabytes in front of you and you need to analyze them, then write a script--and were I interviewing someone for a job I would not hesitate to accept a script that solves a data analysis task, so clearly the people being interacted with by the author were somewhat myopic. But most companies require that an ecosystem be built to handle the increasing complexity that will ensue over the years. And Hadoop is a huge bootstrap to that ecosystem, regardless of data size.