TechEcho

merittover 12 years ago

Would love to see if indexes and a sane schema were used for the RDBMS case. I've built extremely large reporting databases (Dimensional Modeling techniques from Kimball) that perform exceedingly well for very adhoc queries. If your query patterns are even somewhat predictable and occur frequently, it's far better have a properly structured and indexed database than using the "let's analyze every single data element every single query!" approach that is implicit with Hadoop and MR.<p>Not to mention the massive cost-savings from using the right technology with a small footprint versus using a brute-force approach and a large cluster of machines.

评论 #4764668 未加载

zachroseover 12 years ago

Naive question: What does analyzing big data sets get you that sampling doesn't?

评论 #4764268 未加载

评论 #4763761 未加载

评论 #4763795 未加载

评论 #4763673 未加载

评论 #4764064 未加载

评论 #4764639 未加载

评论 #4763771 未加载

zwassover 12 years ago

I'm confused by the assertion that Hive was "much slower than using MySQL with the same dataset." The author makes this claim, and then provides a table that shows Hive performing ~50% better than MySQL on a variety of datasets (none of which really flex the muscle of Hadoop in operating on data sets going beyond single digit GB).<p>Regardless, Impala sounds like it could be pretty sweet!

xradionutover 12 years ago

"These aren’t scientific benchmarks by any means (nothing’s been especially tuned or optimized)..."<p>I had to smile when I read that. Working with data, sometimes optimization or redesign can yield significant performance gains. (Especially when reworking some of my colleages queries or code...)

How I came to love big data

4 comments

How I came to love big data

4 comments