I was part of a large migration that moved a significant amount of data from MongoDB to HBase recently. Along the way, I also spent significant time digging into Cassandra and Riak.<p>I appreciate the effort and intention behind this article, but for all practical purposes, such numbers are really not helpful. From experience, if you really want performance from HBase, you need to spend significant amount of time coming up with the right way to structure your data in HBase. To name a few:<p>* choosing the right row key that optimizes bulk scans<p>* setting optimum client and server caching based on the size of each row<p>* pre-splitting regions, and setting custom region sizes<p>You will also run into various cluster-related issues. Things don't really scale linearly as you add more nodes. You need to also consider maintenance, upgrades, backups, replication and so on.<p>If you want to choose a NoSQL (or for that matter any database), spend some time thinking about whether it fits your data model and your own understanding of the technology. Performance is rarely gained by simply switching a few knobs.
Note that the HBase numbers are so good because the client wasn't actually hitting the database: <a href="http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmarks-from-ycsb-w-side.html" rel="nofollow">http://brianoneill.blogspot.com/2012/10/solid-nosql-benchmar...</a>
Not sure I buy the paper's analysis. They don't seem to know much about load testing -- they speak as though throughput is 'load' (rather than the multiprogramming level) and don't computer Bandwidth - Delay products.<p>Also, rather than using M/M/1 or some reasonable analytic model, they deliberately trottled their request rates to hold throughput constant (thereby guaranteeing different loadings for different 'benchmarks')<p>Just reading the first graph, for example, and applying Little's Law, it's pretty evident that Cassandra was loaded more heavily than than the two MySQL systems, with HBASE and Riak trailing.<p>Looks like HBASE and Cassandra lead the pack to me, with different characterists for different purposes.<p>Advice to authors: buy a book by Neil Gunther.
A more rigorous set of tests (including datasets that don't fit in memory, for instance) was presented at VLDB this year: <a href="http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf" rel="nofollow">http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf</a>
I wonder why they picked MyISAM for MySQL, since basically no one sane is using that anymore. It isn't even the default anymore, so that can't be the excuse.<p>"MyISAM caches index blocks but not data blocks."
A note for people doing benchmarks - please post your configurations for the services with your post as well - it would be even better if your tests were posted on github or something similar so others can reproduce your tests and validate/update them.<p>Also, unless you're running all of the tests concurrently, we really need to see your IOPs records during the tests, since a single lagging EBS volume in a Raid-0 array will still negatively affect disk performance (either that or spring for dedicated iops), and thus skew your benchmarks.<p>Making sure you're not getting a spike CPU steal time would be great as well; same reason.
Well, the data never leave JVM, that is why it is so "fast" - it never fsyncs.<p>What if JVM instance crash under load? Data lose, but, see, it is not our fault - our code is OK.
How would you compare SQL benchmarks? Oracle, mySQL, SQL Server, Postgresql? You can find any specific use case where one will out-perform the others. A lot of DBAs assume the Oracle is the most powerful, but it is also harder to manage and VERY expensive to run. I know that most NoSQL databases are free, except for support or multiple datacenter usage- but what about the cost for DevOps? Or backup?
I guess the conclusion pretty much sums it up. Every db solution has its own advantages and disadvantages. I find many people jump on latest bandwagon and later regret the choice followed by a "Why we moved away from xyz database" post on their blog. Please make sure to analyze your application and future strategy fully before choosing one.
"After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics."<p>Oh, great.
It's a shame they didn't test CouchBase 2.0 (or at least 1.8)... and really kinda silly, given that it is pretty widely used commercially. I think CouchBase may be the most successful NoSQL database, when it comes to commercial installations (for large customers, maybe mongodb has more total customers.)<p>Plus, I think it would have scored very well here.