科技回声

9 条评论

andr大约 16 年前

They are different beasts really. It'd be hard to implement something complex, like the PageRank algorithm, in SQL. On the other hand, MapReduce is not designed for tasks such as "SELECT * FROM table WHERE field > 5", which is one of the benchmarks the paper uses.Also, parallel DBMSs don't handle failure well. Ask any of the MapReduce developers at Google and they'll tell you failure handling is the hardest part of the problem.With all that said, I'd be very interested to see a comparison between Vertica, DBMS-X and MySQL, running in sharded or MySQL Cluster mode.PS The paper is coauthored by Microsoft.

评论 #565194 未加载

评论 #565594 未加载

评论 #565406 未加载

codeslinger大约 16 年前

Did no one else notice that this paper was authored by the C-Store guys, the ones that went on to create and found Vertica (specifically Stonebraker)? They have an intense bias against map/reduce, as could be expected.As well, one important fact that people seem to have missed is that they admit that m/r was better at loading the tuning the data, a task which in its regular name (ETL) dominates most data warehousing workloads.Finally, please note the price tags on the solutions advocated by this paper. They are far out of the reach of any startup founders reading this paper. I looked at them at both Commerce360 and Invite Media and they were just nowhere near reasonable enough for us to replace our existing Hadoop infrastructure with.

vicaya大约 16 年前

The paper is written to sell Vertica, which is quickly becoming less appealing given the rise of open source parallel DBs like HBase and Hypertable.Hadoop is built for high throughput not low latency. If you read the paper, Hadoop runs circle around Vertica etc for data loading.It's entirely possible to build a low latency map-reduce implementation.

评论 #565899 未加载

strlen大约 16 年前

Keep in mind that the article itself admits that the loading the data into an RDBMS took much longer than loading into onto a Hadoop cluster.That's what the main crux is: if your data isn't relational to start with (e.g. messages in a message queue, being written to files in HDFS), you're better of with Map/Reduce. If your data is relational to start with, you're better off doing traditional OLAP on an RDBMs cluster.When I worked for a start-up ad network, we couldn't afford the latency/contention implications of writing to a database each time we served an ad. Initially, we moved the model to "log impressions to text files, rsync text files over, run a cron job to enter them into an RDBMS". Problem is, that processing the data (on a single node) was taking so long the data would no longer by meaningful by the time it would be in RDMBS (and OLAP queries would finish running).I thought of "how can I ensure that a specific machine only processes a specific log file, or a specific section of a log file" -- but then I realized that's what the map part of map/reduce is. In the end I've setup a Hadoop cluster, which made both "soft real time" (hourly/bi-hourly) data processing and long term queries (running on multiple month, multi-terrabyte datasets) much faster and easier.Theoretically, yes: RDBMS is the most efficient way to run complex analytics queries. Practically, we have to handle such issues as asynchronicity, contention, ease of scalability and getting the data into a relational format (which itself is something that could be done using Hadoop: one of the tasks we used it for was transforms all sorts of text data -- crawled webpages, log files -- into CSV data that could be loaded onto the OLAP/OLTP clusters with "LOAD DATA INFILE").

评论 #565910 未加载

strlen大约 16 年前

Another issue:CAP Theorem: Consistency, Availability and (network) Partition Tolerance: pick two.RDBMS are excellent at providing consistency and availability with low latency. Yet imagine building a massive parallel web map/crawling infrastructure on top of an RDMBS -- particularly with multiple datacenters and high latency links between these datacenters: it may be feasible for an enterprise/intranet crawler, but not for something at the scale of Yahoo or Google search (both of which build up their datasets/indices using map/reduce and then distribute these indices to search nodes for low latency querying via more traditional approaches).

mjs大约 16 年前

The conclusion is a pretty good summary, if you don't want to read the whole lot.A big thing that's missing from Parallel DBMSs is fault tolerance, which the paper cops to (but also diminishes):"MR is able to recover from faults in the middle of query execution in a way most parallel database systems cannot. ... If a MR system needs 1,000 nodes to match the performance of a 100 node parallel database system, it is ten times more likely that a node will fail while a query is executing. That said, better tolerance to failures is a capability that any database user would appreciate."Also: "Although we were not surprised by the relative performance advantages provided by the two parallel database systems, we were impressed by how easy Hadoop was to set up and use in comparison to the databases."

wmf大约 16 年前

Previous HN discussions:<a href="http://news.ycombinator.com/item?id=562324" rel="nofollow">http://news.ycombinator.com/item?id=562324</a><a href="http://news.ycombinator.com/item?id=562732" rel="nofollow">http://news.ycombinator.com/item?id=562732</a>

rgrieselhuber大约 16 年前

Speed isn't necessarily the primary advantage that MapReduce provides; it's more about shifting the burden of scaling backend process from the programmer to the infrastructure.That said (caveat: I don't have a lot of real world MapReduce experience), in my line of work, I do a lot of analysis at less than web scale (there's still a lot of business value there too, folks!), and just the thought of writing a bunch of MapReduce jobs just to achieve the analytical capabilities in vanilla SQL alone is not a pleasant one.If people have experience in getting around this, I'd love to hear about it.

评论 #565189 未加载

cnlwsu大约 16 年前

The empirical evidence Google provides gives a strong argument that map reduce can perform pretty well ' Results 1 - 10 of about 1,990,000,000 ... (0.13 seconds) ' I would say it really comes down to preference, the data set, and how you plan to handle scaling/development.

Parallel DBMSs faster than MapReduce

9 条评论

Parallel DBMSs faster than MapReduce

9 条评论