Good article, but the point is that you should roll your own when you gain a competitive advantage by doing so, and you have the talent to actually execute. I would argue that if your situation matches these criteria, you'll already know it. If you had to read a blog post to start thinking you should write your own big data stack, you probably shouldn't.
I could not understand several points made in this article:<p>1: He mentions infochimps, but according to my knowledge its more of an ebay for datasets rather than support/provider for Big Data Stack, also is it successful? I am unsure how infochimps is related to the Big Data stack.<p>2: From what I remember reading about 80legs, is that it uses distributed grid computing to run the crawlers (something like SETI @ Home), I doubt Hadoop was ever designed for such applications. So this is surely isn't a Hadoop use case.<p>3: Quoting:<p><pre><code> While the standard big data stack has made huge strides in making big data more accessible to everyone, it will always fall short against our stack when it comes to the cost of collecting data. We actually don’t store that much data. Because 80legs users can filter their data on the nodes, they’re able to return the minimum amount of data in their result sets. The processing (or reduction, pardon the pun) is done on the nodes. Actual result sets are very small relative to the size of the input set.
</code></pre>
Again I am unsure how it is different from Hadoop? First Hadoop uses same principle "to move computation closer to data" hence a crawler implemented using Hadoop (something Hadoop is not intended to do) will also store data locally and not on some other node.<p>Also he mentions """ We have about 50,000 computers using their excess bandwidth.""" 50,000?? The biggest Hadoop cluster That I know (Yahoo) has ~10-20k nodes, and Hadoop was never meant to be used at 50K scale for crawling. So they had no option other than building their own system, even if they had to build it today.<p>4: Quoting<p><pre><code> One advantage is optimization — an “off-the-shelf” system is going to have some generalities built into it that can’t be optimized to fit your needs. The opportunity cost of going “standard” is a slew of competitive advantages.
</code></pre>
The only issue I can think of regarding Hadoop is that its written in JAVA, otherwise its an extremely extensible piece of software. Unless you are designing a real time messaging system or distributed system for High Frequency Trading, Hadoop is good enough for most of the applications. Also what about cost of finding good enough programmers who are capable of building a system? Another advantage of Hadoop is that in case of a low load the remaining nodes can be used to do something else, maybe processing some data, with your own solution it would be harder to do it. Also your IP and your Secret Sauce isn't of much use, if you dont have solid Patents for them, otherwise they would mostly end up becoming a maintenance nightmare, after original engineers cash out. Also what if the the big company already has Hadoop cluster, it would be even difficult for them to integrate with your computing power.<p>While I seem to agree with Authors conclusion that a highly focused startup should make their proprietary solution, I cannot agree with his evidence behind that argument. A grid based crawler with 50K machines isn't something that Hadoop was ever designed to support.
<i>most true competitive advantages are operational and cultural ones, contrary to popular thinking</i><p>That's true. Technology people tend to focus on the development-side. That is, creating something new and novel rather than running and maintaining something that already exists. Business people do pay more attention to operational concerns, but from a cost-cutting perspective. In other words, both sides tend to see operations as a loss rather than an opportunity.<p>This is probably changing though. It'll take a bit longer to know for sure, but that could be what devops is really about.
If you were considering rolling your own large storage infrastructure, a reasonable place to start might be <a href="https://spideroak.com/diy/" rel="nofollow">https://spideroak.com/diy/</a> (entirely open source.)