TechEcho

8 comments

TYabout 12 years ago

Redshift is based on ParAccel, not on Postgres. ParAccel uses APIs similar to Postgres due to historical reasons, but not the technology.For a basic overview: <a href="http://en.wikipedia.org/wiki/Paraccel" rel="nofollow">http://en.wikipedia.org/wiki/Paraccel</a>As for the rest of the article, it feels like a basic Data Warehousing 101 re-discovered. It should have been titled "Analytics: Back To The Future" :-)

评论 #5281421 未加载

评论 #5281920 未加载

评论 #5293829 未加载

panarkyabout 12 years ago

The article mentions this briefly, but it should be emphasized: parallel loading from S3 is MUCH faster.This weekend I loaded 2 billion rows from S3 both ways:- From a single gzipped object: 4 hours 42 minutes- From 2000 gzipped slices of 1M rows each: 17 minutes(Loading from gzipped files is considerably faster, in addition to saving S3 charges.)The article notes that choice of distribution key is critical. I'd add that choice of sort key is equally important. In my testing, a better sort key improved compression from 1.5:1 to 4:1, and also made common queries 5x faster.Unfortunately, you only get one dist key and one sort key per table, so less common queries could get slower.

评论 #5284492 未加载

csarvaabout 12 years ago

This is the second article I've seen where the authors forget to multiply by the number of redshift nodes. A single XL node is $0.85/hr so 16 nodes would be $13.60/hr. Still cheaper than their Hive configuration obviously but less than a buck?

评论 #5281673 未加载

评论 #5281661 未加载

monstradoabout 12 years ago

Which storage format did you use for Hive? This is very important to how performance plays out, are you using snappy or LZO compression? Also, this is a relevant comment from a Hive committer (<a href="http://news.ycombinator.com/item?id=5248485" rel="nofollow">http://news.ycombinator.com/item?id=5248485</a>).

twogabout 12 years ago

Still running on posterous. Weird to think that this is probably the last posterous blog post I will read.

polskibusabout 12 years ago

It's good to see a Redshift evaluation. I'm wondering how does Redshift compare to hadoop airbnb setup when taking data loading and transformation into consideration as well as running aggregate queries? I mean if you want to run analysis fairly often, do you need to reload everything in Redshift? From maintenance point of view, is Hadoop setup more flexible and cheaper than Redshift?

评论 #5294159 未加载

serbautabout 12 years ago

The first query seems awfully slow. I have a six node vertica cluster with a 100 column table with 7Bn rows in it and a similar query takes less than 3 seconds.

评论 #5282463 未加载

jacques_chesterabout 12 years ago

Dumb question: was the data ETL'd into a star schema first? That can make a big difference, especially in columnar stores.

评论 #5294486 未加载

8 comments

TYabout 12 years ago

评论 #5281421 未加载

评论 #5281920 未加载

评论 #5293829 未加载

panarkyabout 12 years ago

评论 #5284492 未加载

csarvaabout 12 years ago

评论 #5281673 未加载

评论 #5281661 未加载

monstradoabout 12 years ago

twogabout 12 years ago

Still running on posterous. Weird to think that this is probably the last posterous blog post I will read.

polskibusabout 12 years ago

评论 #5294159 未加载

serbautabout 12 years ago

The first query seems awfully slow. I have a six node vertica cluster with a 100 column table with 7Bn rows in it and a similar query takes less than 3 seconds.

评论 #5282463 未加载

jacques_chesterabout 12 years ago

Dumb question: was the data ETL'd into a star schema first? That can make a big difference, especially in columnar stores.

评论 #5294486 未加载

Redshift Performance and Cost

8 comments

Redshift Performance and Cost

8 comments