TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Data-Processing Frameworks Benchmark: Redshift, Hive, Shark, Impala

100 pointsby ceyhunkazelover 11 years ago

5 comments

AmiiJewelsover 11 years ago
I applaud the effort but is this really "big data" - the largest data sets they seem to test are ~150GB, that would fit comfortably on my Mac Book Pro a number of times over. Many of these systems being tested are designed to scale efficiently when the data starts peaking > 5TB and therefore I am dubious about the median response time results - things that work well for small datasets (where small is defined as < 1TB) easily fall apart when you scale them up a little bit more.
评论 #6466644 未加载
评论 #6466491 未加载
评论 #6466895 未加载
评论 #6466643 未加载
CurtMonashover 11 years ago
The author had a terrible brain cramp in the sentence &quot;Redshift uses columnar compression which allows it to bypass a field which is not used in the query.&quot;<p>That totally confuses columnar compression with columnar I&#x2F;O, an error I&#x27;ve been railing against for several years, e.g. in <a href="http://www.dbms2.com/2011/02/06/columnar-compression-database-storage/" rel="nofollow">http:&#x2F;&#x2F;www.dbms2.com&#x2F;2011&#x2F;02&#x2F;06&#x2F;columnar-compression-databas...</a> (I.e., ever since Oracle tried to popularize the confusion.) But this is a particularly bad instance.
评论 #6467805 未加载
espeedover 11 years ago
Spark is a big deal (<a href="http://spark.incubator.apache.org/" rel="nofollow">http:&#x2F;&#x2F;spark.incubator.apache.org&#x2F;</a>). It&#x27;s a next-gen open source cluster-computing system built on top of the Berekely Data Analytics Stack (BDAS - <a href="https://amplab.cs.berkeley.edu/software/" rel="nofollow">https:&#x2F;&#x2F;amplab.cs.berkeley.edu&#x2F;software&#x2F;</a>), which includes Mesos, Spark, SparkStreaming, Shark, and GraphX (to name a few).<p>Mesos is the foundation of the stack, and Spark started out as a research project because they needed something to run on Mesos. But you can also run Hadoop on Mesos, and you can run Spark and Hadoop on the same Mesos cluster. Twitter runs almost everything on Mesos and works directly with AMPLab on the project.<p>See Benjamin Hindman&#x27;s presentation on &quot;Managing Twitter Clusters with Mesos&quot; (<a href="http://www.youtube.com/watch?v=37OMbAjnJn0&amp;list=PL9F5093F238695612&amp;index=5" rel="nofollow">http:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=37OMbAjnJn0&amp;list=PL9F5093F238...</a>).<p>SparkStreaming replaces the need for Storm and handles failures&#x2F;stragglers better. As Nathan Marz the creator of Storm said, &quot;Spark is interesting because it extends MapReduce with a new primitive that allows Pregel to be built on top of it. So Spark is both Hadoop and Pregel&quot; (<a href="http://nathanmarz.com/blog/thrift-graphs-strong-flexible-schemas-on-hadoop.html#comment-334743458" rel="nofollow">http:&#x2F;&#x2F;nathanmarz.com&#x2F;blog&#x2F;thrift-graphs-strong-flexible-sch...</a>).<p>GraphX (<a href="https://amplab.cs.berkeley.edu/publication/graphx-grades/" rel="nofollow">https:&#x2F;&#x2F;amplab.cs.berkeley.edu&#x2F;publication&#x2F;graphx-grades&#x2F;</a>) is new, and it&#x27;s GraphLab2 built on Spark, which enables fast processing of Pregel-like algorithms. GraphLab2 (<a href="http://graphlab.org/" rel="nofollow">http:&#x2F;&#x2F;graphlab.org&#x2F;</a>) includes a suite of machine learning tools (similar to Mahout).<p>Berkeley&#x27;s &quot;Analyzing Big Data with Twitter&quot; series (<a href="http://www.youtube.com/playlist?list=PLE8C1256A28C1487F" rel="nofollow">http:&#x2F;&#x2F;www.youtube.com&#x2F;playlist?list=PLE8C1256A28C1487F</a>) includes a couple of presentations related to the project.<p>The last presentation is by the Spark-lead Matei Zaharia (<a href="http://www.cs.berkeley.edu/~matei/" rel="nofollow">http:&#x2F;&#x2F;www.cs.berkeley.edu&#x2F;~matei&#x2F;</a>), and he gives a good high-level overview: &quot;Analyzing Big Data with Twitter: Spark&quot; (<a href="http://www.youtube.com/watch?v=rpXxsp1vSEs&amp;list=PLE8C1256A28C1487F&amp;index=15" rel="nofollow">http:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=rpXxsp1vSEs&amp;list=PLE8C1256A28...</a>).<p>There is another presentation in the series by GraphLab-lead Joey Gonzalez (<a href="http://www.cs.cmu.edu/~jegonzal/" rel="nofollow">http:&#x2F;&#x2F;www.cs.cmu.edu&#x2F;~jegonzal&#x2F;</a>): &quot;GraphLab: Big Learning with Graphs&quot; (<a href="http://www.youtube.com/watch?v=E1LwqtBdPYs" rel="nofollow">http:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=E1LwqtBdPYs</a>).<p>See also the GraphLab paper &quot;PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs&quot; (<a href="https://www.usenix.org/system/files/conference/osdi12/osdi12-final-167.pdf" rel="nofollow">https:&#x2F;&#x2F;www.usenix.org&#x2F;system&#x2F;files&#x2F;conference&#x2F;osdi12&#x2F;osdi12...</a>) and presentation (<a href="https://www.usenix.org/conference/osdi12/167-powergraph-distributed-graph-parallel-computation-natural-graphs" rel="nofollow">https:&#x2F;&#x2F;www.usenix.org&#x2F;conference&#x2F;osdi12&#x2F;167-powergraph-dist...</a>).<p>AMPLab has plenty of sponsors (<a href="https://amplab.cs.berkeley.edu/sponsors/" rel="nofollow">https:&#x2F;&#x2F;amplab.cs.berkeley.edu&#x2F;sponsors&#x2F;</a>), both Twitter and Yahoo are adopting it, and evidently Facebook and Amazon may too (<a href="http://www.wired.com/wiredenterprise/2013/06/yahoo-amazon-amplab-spark/all/" rel="nofollow">http:&#x2F;&#x2F;www.wired.com&#x2F;wiredenterprise&#x2F;2013&#x2F;06&#x2F;yahoo-amazon-am...</a>). The more I learn about the BDAS stack, the more I think it&#x27;s going to usurp Hadoop&#x2F;Storm.<p>For more on AMPLab, see...<p>AMPLab Stack Presenations: <a href="http://www.youtube.com/user/BerkeleyAMPLab/feed?activity_view=5" rel="nofollow">http:&#x2F;&#x2F;www.youtube.com&#x2F;user&#x2F;BerkeleyAMPLab&#x2F;feed?activity_vie...</a><p>Slides&#x2F;Summaries: <a href="http://ampcamp.berkeley.edu/amp-camp-one-berkeley-2012/" rel="nofollow">http:&#x2F;&#x2F;ampcamp.berkeley.edu&#x2F;amp-camp-one-berkeley-2012&#x2F;</a>
评论 #6468007 未加载
评论 #6467064 未加载
ceyhunkazelover 11 years ago
For a starter guide to Amazon Redshift there is a book <a href="http://www.amazon.com/Getting-Started-Amazon-Redshift-Stefan/dp/1782178082/" rel="nofollow">http:&#x2F;&#x2F;www.amazon.com&#x2F;Getting-Started-Amazon-Redshift-Stefan...</a>
ceyhunkazelover 11 years ago
SAP HANA would be a better option than Redshift. You can get cloud version of HANA. It support R, JavaScript, ArgGIS and more SQL data types.
评论 #6467519 未加载
评论 #6466589 未加载
评论 #6466665 未加载