Ask HN: BigQuery vs. Redshift vs. Athena vs. Snowflake

26 点作者 paladin314159大约 8 年前

I'm investigating potential hosted SQL data warehouses for ad-hoc analytical queries. I've read quite a bit about all 4 of these technologies and their high-level tradeoffs, but I'd love to get people's first-person accounts of the pros and cons of each.Dimensions I care about: query performance, load performance, cost, stability, maintainability, features (e.g. views, UDF), limitations (e.g. concurrent queries, schema restrictions), and all of these at scale.If you've used any of these and have thoughts, would appreciate hearing them, thanks!

9 条评论

frew大约 8 年前

Based on random use over the past few years.Redshift: Pros: Has the most adoption, so most integrations from SaaS services etc. are built with Redshift as their sink. Relatively fast and battle-tested.Cons: In an awkward middle ground where you're responsible for a lot of operations (e.g. capacity planning, setting up indexes), but don't have a lot of visibility. Some weirdness as a result of taking PostgreSQL and making it distributed.BigQuery: Pros: Rich feature set. Pay-per-TB pricing. Recently released standard-ish SQL dialect. Very fast.Cons: JDBC driver is recent and doesn't have support for e.g. CREATE TABLE AS SELECT (as of a couple of months ago) so harder to integrate with existing systems. There are ways to run out of resources (e.g. large ORDER BY results) without a good path to throw more money at the problem.Athena: Pros: Built off of the open-source Presto database so can use the documentation there. Pay-per-TB pricing.Cons: Slower than the other options listed here. Very early product so lacking some in documentation and some cryptic errors. Not a lot of extensibility, but you could theoretically move to just using open-source Presto.Haven't had a chance to evaluate Snowflake.

评论 #13921404 未加载

bsg75大约 8 年前

My perspective on a brief trial of BigQuery and RedShift:1) RedShift is PostgreSQL 8 with some additional features, and while can significantly improve some query runtimes, comes with usual DBA burdens as an on-premise database.2) BiqQuery is a true cloud implementation. Its SQL syntax is evolving and may not have some features older platforms do, but the concepts of the physical implementation is completely removed from the user's concern.Were I going to start a new project without existing dependence on AWS, BigQuery would have an advantage.

user5994461大约 8 年前

BigQuery for data storage + Looker for visualization.RedShift is not a real managed service. You have to handle all the maintenance/scaling/indexing/provisioning, it's a hassle.

DeGi大约 8 年前

We have ~80TB of (compressed) data in Snowflake at Celtra and I'm working with Snowflake on a daily basis. We've been using it for the last ~1 year in production. Overall maintenance is minimal and the product is very stable.Pros:<pre><code> - Support for semi-structured nested data (think json, avro, parquet) and querying this in-database with custom operators - Separation of compute from storage. Since S3 is used for storage, you can just spawn as many compute clusters as needed - no congestion for resources. - CLONE capability. Basically, Snowflake allows you to do a zero-copy CLONE, which copies just the metadata, but not the actual data (you can clone a whole database, a particular schema or a particular table). This is particularly useful for QA scenarios, because you don't need to retain/backup/copy over a large table - you just CLONE and can run some ALTERs on the clone of the data. Truth be told, there are some privilege bugs there, but I've already reported those and Snowflake is working on them. - Support for UDFs and Javascript UDFs. We've had to do a full ~80TB table rewrite and being able to do this without copying data outside of Snowflake was a massive gain. - Pricing model. We did not like query-based model of BigQuery a lot, because it's harder to control the costs. Storage on Snowflake costs the same as S3 ($27/TB compressed), BigQuery charges for scans of uncompressed data. - Database-level atomicity and transactions (instead of table-level on BigQuery) - Seamless S3 integration. With BigQuery, we'd have to copy all data over to GCS first. - JDBC/ODBC connectivity. At the time we were evaluating Snowflake vs. BigQuery (1.5 years ago, BigQuery didn't support JDBC) - You can define separate ACLs for storage and compute - Snowflake was faster when the data size scanned was smaller (GBs) - Concurrent DML (insert into the same table from multiple processes - locking happens on a partition level) - Vendor support - ADD COLUMN, DROP COLUMN, RENAME all work as you would expect from a columnar database - Some cool in-database analytics functions, like HyperLogLog objects (that are aggregatable) </code></pre> Cons:<pre><code> - Nested data is not first-class. It's supported by semi-structured VARIANT data type, but there is no schema if you use this. So you can't have nested data + define a schema both at the same time, you have to pick just one. - Snowflake uses a proprietary data storage format and you can't access data directly (even though it sits on S3). For example when using Snowflake-Spark connector, there is a lot of copying of data going on: S3 -> Snowflake -> S3 -> Spark cluster, instead of just S3 -> Spark cluster. - BigQuery was faster for full table scans (TBs) - Does not release locks if connection drops. It's pain to handle that yourself, especially if you can't control the clients which are killed. - No indexes. Also no materialized views. Snowflake allows you to define Clustering keys, which will retain sort order (not global!), but it has certain bugs and we've not been using it seriously yet. Particularly, it doesn't seem to be suited for small tables, or tables with frequent small inserts, as it doesn't do file compaction (number of files just grows, which hits performance). - Which brings me to the next point. If your use case is more streaming in nature (more frequent inserts, but smaller ones), I don't think Snowflake would handle this well. For one use case, we're inserting every minute, and we're having problems with number of files. For another use case, we're ingesting once per hour, and this works okay. </code></pre> Some (non-obvious) limitations:<pre><code> - 50 concurrent queries/user - 150 concurrent queries/account - streaming use cases (look above) - 1s connection times on ODBC driver (JDBC seems to be better) </code></pre> If you decide for Snowflake or have some more questions, I can help with more specific questions/use cases.

评论 #13966237 未加载

评论 #13920119 未加载

scapecast大约 8 年前

What does the rest of your infrastructure look like? Are you running on AWS already? or on GCP?

评论 #13918008 未加载

vgt大约 8 年前

I worked on the BigQuery team and happy to answer any questions you may have. There are quite a few AWS customers running on BigQuery. (See [0], [1] [7], [8] )I'll offer some resources:- A session from last Google NEXT, where Yahoo, NY Times, and Blue Apron detail their experience with migrating to BigQuery [0]- A session from previous Google NEXT, where Kabam Games detailed their experience with migrating to BigQuery (and running the rest of their workload on AWS) [1]- A blog "15 things you probably didn't know about BigQuery" [2]- A blog detailing what BigQuery means by "fully managed" [3]- A blog offering some details on most recent iteration of Dremel "In-memory query execution in Google BigQuery" [4]- A blog describing Capacitor, the opinionated storage format that replaced ColumnIO (on which Parquet is based) [5]- "BigQuery Under The Hood" [6]- Since you specifically asked about Load Performance, [9]( A wholly redundant disclosure that I work at Google Cloud)[0] <a href="https://www.youtube.com/watch?v=TLpfGaYWshw" rel="nofollow">https://www.youtube.com/watch?v=TLpfGaYWshw</a>[1] <a href="https://www.youtube.com/watch?v=6Nv18xmJirs" rel="nofollow">https://www.youtube.com/watch?v=6Nv18xmJirs</a>[2] <a href="https://medium.com/google-cloud/15-awesome-things-you-probably-didnt-know-about-google-bigquery-6654841fa2dc" rel="nofollow">https://medium.com/google-cloud/15-awesome-things-you-probab...</a>[3] <a href="https://cloud.google.com/blog/big-data/2016/08/google-bigquery-continues-to-define-what-it-means-to-be-fully-managed" rel="nofollow">https://cloud.google.com/blog/big-data/2016/08/google-bigque...</a>[4] <a href="https://cloud.google.com/blog/big-data/2016/08/in-memory-query-execution-in-google-bigquery" rel="nofollow">https://cloud.google.com/blog/big-data/2016/08/in-memory-que...</a>[5] <a href="https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format" rel="nofollow">https://cloud.google.com/blog/big-data/2016/04/inside-capaci...</a>[6] <a href="https://cloud.google.com/blog/big-data/2016/01/bigquery-under-the-hood" rel="nofollow">https://cloud.google.com/blog/big-data/2016/01/bigquery-unde...</a>[7] <a href="https://cloud.google.com/customers/sharethis/" rel="nofollow">https://cloud.google.com/customers/sharethis/</a>[8] <a href="https://storify.com/felipehoffa/dow-jones-on-bigquery" rel="nofollow">https://storify.com/felipehoffa/dow-jones-on-bigquery</a>[9] <a href="https://medium.com/google-cloud/paying-it-forward-how-bigquerys-data-ingest-breaks-tech-norms-8bfe2341f5eb" rel="nofollow">https://medium.com/google-cloud/paying-it-forward-how-bigque...</a>

评论 #13927004 未加载

scapecast大约 8 年前

Disclaimer: I'm the co-founder of a start-up that provides performance analytics for Amazon Redshift. So I'm biased towards using Redshift because otherwise I can't sell you our service. With that cleared, some considerations.One of our customers I thought made a great observation. "I see a lot of people using Redshift. But I don't see anybody using it happily". Because of all the issues that people point out in this thread. lots of operations. little visibility ("black box"). All very true. It's the problem our product solves, and let me talk about how we see companies successfully using Redshift at scale. (and that customer is now a happy Redshift user, btw).Here's the architecture that we're seeing companies on AWS moving to:- all your data in S3 (I hate the term, but call it a "data lake")- a subset of your data in Redshift, for ongoing analysis- daily batch jobs to move data in and out of Redshift from / to S3- Athena to query data that's in S3 and not in RedshiftThe subset of the data sitting in Redshift is determined by your needs / use cases. For example, let's say you have 3 years of data, but your users only query data that's less than 6 months old. Then moving data older than 6 months to S3 makes a lot of sense. Much cheaper. For the edge cases where a users does want to query data older than 6 months, you use Athena to query data sitting in S3.Why not use Athena for everything? Two reasons. For one, Athena is less mature. And then you still need a place to run your transformations, and Redshift is the better choice for that.Three uses cases for Redshift:1) classic BI / reporting ("the past")2) log analysis ("here and now")3) predictive apps ("the future")Your use case will determine how you have to think about your data architecture. For some it's ok if a daily report takes a few minutes to run. I'll go get a coffee. It also means my batch jobs are running on daily or hourly cycles. But if I'm running a scoring model, e.g. for fraud prevention, I want that score to be as fresh / real-time as possible. And make sure that the transformations leading up to that score have executed in their proper order.For Redshift to work at scale, there are really only three key points you need to check the box on:- setting up your WLM queues so that your important workloads are isolated from each other- allocating enough concurrency and memory to each queue to enable high throughput- monitoring your stack for real-time changes that can affect those WLM settingsThat's it. And then Redshift has a rich ecosystem and rich feature set to enable all uses cases (etl vendors, dashboard tools, etc.). Those users who have addressed the performance challenges that do come at scale for Redshift - they're happy users and never looked back.On the pay-per-TB pricing for BigQuery and Snowflake. I think that's more of a marketing spin. Most companies we work with are compute-bound. So per-TB pricing helps them very little. They want to crunch their data as fast as possible. More CPUs give them more I/O. If you feel you're storage-bound - take a hard look at the data that really needs to be available in Redshift for analysis, and move everything else to S3.For inspiration, watch the AWS Reinvent videos on S3 / Redshift / Athena from Netflix and NASDAQ. Like this one:<a href="https://www.youtube.com/watch?v=o52vMQ4Ey9I&t=256s" rel="nofollow">https://www.youtube.com/watch?v=o52vMQ4Ey9I&t=256s</a>If you're entire data is already within AWS, I think Redshift is the way to go. But then again I'm biased :)

评论 #13920921 未加载

codeonfire大约 8 年前

BigQuery is like a midrange luxury car, easy to do anything, looks great, and good performance. Redshift is like someone's turbo boosted Honda Civic. Athena is maybe like a Camaro, You dislike the concept, but they maybe made it powerful and easy enough that you want it. This is probably no help.

marklit大约 8 年前

I've used BQ, RS and Athena. Their benchmark results and links to how I conducted benchmarks on each can be found here: <a href="http://tech.marksblogg.com/benchmarks.html" rel="nofollow">http://tech.marksblogg.com/benchmarks.html</a>I could probably write a book on my experiences but I'll keep this short.BigQuery requires the least amount of work to keep performant but this comes at a cost (this cost matches Athena).Redshift seems to have support for just about every BI tool out there including Periscope which I've found to be popular with my customers. The downside is that you've really got to think about how you layout your data, how you write your queries and there is something of a weekly task of bringing up the console and hunting down the worst-offending queries for optimisation.Keep in mind if you're already on Google Cloud or AWS you could incur bandwidth charges moving data to the other provider.

9 条评论

frew大约 8 年前

评论 #13921404 未加载

bsg75大约 8 年前

user5994461大约 8 年前

BigQuery for data storage + Looker for visualization.RedShift is not a real managed service. You have to handle all the maintenance/scaling/indexing/provisioning, it's a hassle.

DeGi大约 8 年前

评论 #13966237 未加载

评论 #13920119 未加载

scapecast大约 8 年前

What does the rest of your infrastructure look like? Are you running on AWS already? or on GCP?