Processing 6 billion records in 4 seconds on a home PC

96 pointsby rckover 11 years ago

9 comments

aiianeover 11 years ago

Hm. I'm looking at this and "6 billion in 4 seconds" seems misleading - the test that appears to be referring to is "Query 6", which (a) only examines records from 1994 (b) on a table with entries sorted by timestamp, such that (c) only the portions of the data which fall in the correct timerange are actually sent for processing.In other words, it's not actually really looking at a full 6 billion records for that query. More accurate would be the next query discussed, "Query 1", which takes 72 seconds to look over a much more significant portion of that 6 billion records.It's still a pretty impressive set of numbers (as one would expect from GPU SIMD processing), but it irks me when short descriptions bend the facts to try to sound more significant. (Not to mention anything about the disk time subtraction.)

评论 #6473014 未加载

MichaelGGover 11 years ago

"time is counted as total processing time minus disk time."Can anyone explain why that's a valid benchmark for him to use? Certainly the Hadoop version had significant disk access time?

评论 #6472996 未加载

评论 #6472614 未加载

评论 #6472966 未加载

idle_processorover 11 years ago

>NVidia Titan GPU : it is a relatively cheap, massively parallel GPURelatively cheap compared to having to build clusters, perhaps, but $1000 isn't cheap for a desktop computing GPU.A mid-high tier card (GeForce GTX 770) is closer to $400. A mid-range gaming card (GTX 760) is closer to $260.Those finding the topic link interesting may also be interested in this CUDA radix sorting article[0] from 2010, as it featured "one billion 32-bit keys sorted per second."[0] <a href="https://code.google.com/p/back40computing/wiki/RadixSorting" rel="nofollow">https://code.google.com/p/back40computing/wiki/RadixSorting</a>

评论 #6472941 未加载

评论 #6472692 未加载

DannyBeeover 11 years ago

"* - time is counted as total processing time minus disk time. "So, in other words, i subtracted time that both actually have to spend, for no good reason.In order to further make results look better, I only subtracted it from my database, instead of running tests myself and subtracting it from both.

评论 #6472634 未加载

评论 #6473619 未加载

staunchover 11 years ago

There's nothing about the Hadoop model that precludes the use of GPUs instead of CPUs. Hadoop solves the problem of storing massive quantities of data and processing it using a large number of machines. There's no reason the processing can't be done using GPUs.

评论 #6472911 未加载

cartickover 11 years ago

First, Hadoop has two parts, HDFS and MapReduce. This so called benchmark, compares only the computation part of it. For people who say Hadoop is slow, they never really understood what Hadoop is. MapReduce is meant for processing big data in a batch oriented way and not meant for real time analytics. However, there are many technologies that work on top of Hadoop that will give real time analytics capabilities like HBase,and Impala. Column oriented storage is available in Hadoop too, Parquet. Also with the Hadoop, the real power comes with the availability of UDFs and streaming. Please don't do any stupid benchmark like this without getting to know what you are comparing to.

brandynwhiteover 11 years ago

I've been using hadoop for 4 years now (author of hadoopy.com) so I'll chime in. I'll state the use cases that Hadooop/MapReduce (and to a close approximation the ecosystem around them) were developed for so that we're on the same page. 1.) Save developers time at the expense of inefficiency (compared to custom systems), 2.) Really huge data (several petabytes), 3.) Unstructured data (e.g., webpages), 4.) Fault tolerance, 5.) Shared Cluster Resource, and 6.) Horizontal scalability. Basically people already had that and wanted easier queries, so it's been pulled that way for the second generation now 1.) Pig/Hive and 2.) Impala and others.Of the 6 design considerations I listed, none of them are really addressed here. If you outgrow a single GPU then you have a huge performance penalty growing (that's a vertical growth). If you want to make your own operations (very common), then this would be impractical.It's a nice idea but it'd be better to compare against things like Memsql and the like, where they have been designed from first principles for fast SQL processing. I'd recommend just dropping any Hadoop/HBase comparisons and compare within the same class, Hive is embarrassingly slow even in the class it's in (compare it to Google's Dremel/F1 or Apache Impala).

评论 #6473608 未加载

quizoticover 11 years ago

The two most interesting things about this article to me were unstated.1. The TPC-H benchmark is measured in price-for-performance ($/QphH, or dollars per queries-per-hour). At 4 seconds for Q6, he's getting ~900 queries per hour. The cost of his rig is probably ~$2k, so he's under $2 per QphH. The top TPC-H scores are around $.10, but <$10 is pretty good for a first go.2. The standard knock against GPU processing is the time it takes to load GPU memory. GPU processing may be blazing once data is in memory. But there was an MIT paper last year claiming you couldn't load the GPU fast enough to keep up. Evidently, he's keeping up.With regard to comparing his performance to hadoop/hive - yeah it's apples and oranges, but he's in good company. Hadapt, Hortonworks Stinger, Cloudera Impala, Spark/Shark and others all rate themselves on how many times faster they are than Hive.And frankly, I don't buy the whole "the point of MR is for huge, horizontally scaling networks" If you factor out Yahoo!, Facebook, Amazon, LinkedIn and a few others, the largest remaining hadoop clusters are all WELL south of 1000 nodes. And most run on homogenous high-end hardware.

shousperover 11 years ago

So, I found this from back in 2011: <a href="http://www.tomshardware.com/news/ibm-patent-gpu-accelerated-database-cuda,13866.html" rel="nofollow">http://www.tomshardware.com/news/ibm-patent-gpu-accelerated-...</a> However, I couldn't find any commercial or even (active) open source projects on this topic. It seems like something that would be valuable to businesses working with big data, so what's the hold up? Has nobody reached this scale yet? Is it still too expensive? I don't get it.. Maybe I'm overthinking it.

评论 #6473121 未加载

评论 #6473110 未加载

9 comments

aiianeover 11 years ago

评论 #6473014 未加载

MichaelGGover 11 years ago

"time is counted as total processing time minus disk time."Can anyone explain why that's a valid benchmark for him to use? Certainly the Hadoop version had significant disk access time?