We have ~80TB of (compressed) data in Snowflake at Celtra and I'm working with Snowflake on a daily basis.
We've been using it for the last ~1 year in production. Overall maintenance is minimal and the product is
very stable.<p>Pros:<p><pre><code> - Support for semi-structured nested data (think json, avro, parquet) and querying this in-database with
custom operators
- Separation of compute from storage. Since S3 is used for storage, you can just spawn as many compute
clusters as needed - no congestion for resources.
- CLONE capability. Basically, Snowflake allows you to do a zero-copy CLONE, which copies just the metadata,
but not the actual data (you can clone a whole database, a particular schema or a particular table). This is
particularly useful for QA scenarios, because you don't need to retain/backup/copy over a large table - you
just CLONE and can run some ALTERs on the clone of the data. Truth be told, there are some privilege bugs
there, but I've already reported those and Snowflake is working on them.
- Support for UDFs and Javascript UDFs. We've had to do a full ~80TB table rewrite and being able to do this
without copying data outside of Snowflake was a massive gain.
- Pricing model. We did not like query-based model of BigQuery a lot, because it's harder to control the costs.
Storage on Snowflake costs the same as S3 ($27/TB compressed), BigQuery charges for scans of uncompressed data.
- Database-level atomicity and transactions (instead of table-level on BigQuery)
- Seamless S3 integration. With BigQuery, we'd have to copy all data over to GCS first.
- JDBC/ODBC connectivity. At the time we were evaluating Snowflake vs. BigQuery (1.5 years ago, BigQuery didn't
support JDBC)
- You can define separate ACLs for storage and compute
- Snowflake was faster when the data size scanned was smaller (GBs)
- Concurrent DML (insert into the same table from multiple processes - locking happens on a partition level)
- Vendor support
- ADD COLUMN, DROP COLUMN, RENAME all work as you would expect from a columnar database
- Some cool in-database analytics functions, like HyperLogLog objects (that are aggregatable)
</code></pre>
Cons:<p><pre><code> - Nested data is not first-class. It's supported by semi-structured VARIANT data type, but there is no schema
if you use this. So you can't have nested data + define a schema both at the same time, you have to pick just
one.
- Snowflake uses a proprietary data storage format and you can't access data directly (even though it sits on
S3). For example when using Snowflake-Spark connector, there is a lot of copying of data going on: S3 ->
Snowflake -> S3 -> Spark cluster, instead of just S3 -> Spark cluster.
- BigQuery was faster for full table scans (TBs)
- Does not release locks if connection drops. It's pain to handle that yourself, especially if you can't
control the clients which are killed.
- No indexes. Also no materialized views. Snowflake allows you to define Clustering keys, which will retain
sort order (not global!), but it has certain bugs and we've not been using it seriously yet. Particularly,
it doesn't seem to be suited for small tables, or tables with frequent small inserts, as it doesn't do file
compaction (number of files just grows, which hits performance).
- Which brings me to the next point. If your use case is more streaming in nature (more frequent inserts, but
smaller ones), I don't think Snowflake would handle this well. For one use case, we're inserting every minute,
and we're having problems with number of files. For another use case, we're ingesting once per hour, and this
works okay.
</code></pre>
Some (non-obvious) limitations:<p><pre><code> - 50 concurrent queries/user
- 150 concurrent queries/account
- streaming use cases (look above)
- 1s connection times on ODBC driver (JDBC seems to be better)
</code></pre>
If you decide for Snowflake or have some more questions, I can help with more specific questions/use cases.