Would love to see if indexes and a sane schema were used for the RDBMS case. I've built extremely large reporting databases (Dimensional Modeling techniques from Kimball) that perform exceedingly well for very adhoc queries. If your query patterns are even somewhat predictable and occur frequently, it's far better have a properly structured and indexed database than using the "let's analyze every single data element every single query!" approach that is implicit with Hadoop and MR.<p>Not to mention the massive cost-savings from using the right technology with a small footprint versus using a brute-force approach and a large cluster of machines.
I'm confused by the assertion that Hive was "much slower than using MySQL with the same dataset." The author makes this claim, and then provides a table that shows Hive performing ~50% better than MySQL on a variety of datasets (none of which really flex the muscle of Hadoop in operating on data sets going beyond single digit GB).<p>Regardless, Impala sounds like it could be pretty sweet!
"These aren’t scientific benchmarks by any means (nothing’s been especially tuned or optimized)..."<p>I had to smile when I read that. Working with data, sometimes optimization or redesign can yield significant performance gains. (Especially when reworking some of my colleages queries or code...)