Related post (2 days ago, 95 comments): [Snowflake’s response to Databricks’ TPC-DS post](<a href="https://news.ycombinator.com/item?id=29206959" rel="nofollow">https://news.ycombinator.com/item?id=29206959</a>)
What I find hilarious is that companies argue who can query 100 TB faster and try to sell this to people. I've been on the receiving end of offers by both of the companies in question and used both platforms (and sadly migrated some data jobs to them).<p>While they can crunch large datasets, they are laughably slow for the datasets most people have. So while I did propose we use these solutions for our big-ish data projects, management kept pushing for us to migrate our tiny datasets (tens of gigabytes or smaller) and the perf expectedly tanked compared to our other solutions (Postgres, Redshift, pandas etc.), never mind the immense costs to migrate everything and train everyone up.<p>Yes, these are very good products. But PLEASE, for the love of god, don't migrate to them unless you know you need them (and by 'need' I don't mean pimping your resume).
The irony here is that what Databricks is doing to Snowflake is exactly what Snowflake did to AWS and Redshift.<p>Same playbook - show that you’re better in a key metric that’s easy to understand (performance) to get the attention, but then pitch the paradigm change.<p>In Snowflake’s case, that was separation of storage and compute.<p>In Databrick’s case, it’s the Lakehouse Architecture.<p>I think the reason why Snowflake is so nervous because they know they can’t win this game.
I've used both products in production. Both are good++.<p>The blog wars seem extremely ridiculous to me. I don't recall ever choosing one over another based on how fast it runs on some imaginary arbitrary dataset.
Snowflake accuses other companies of lacking integrity?<p>I really wish I could block all of Snowflake's domain from my inbox. Sadly, Google encourages spammers to just create a new email address. So I get a few emails each month from Snowflake who ask me to try their products. I've never done business with them and there's no unsubscribe link.<p>Fuck Snowflake for thinking it has any room to talk about integrity.
Snowflake must be kicking themselves hard now for letting a story that was “Databricks is a viable alternative” turn into “Snowflake has absolutely no integrity and will fling mud even while they are gaming the statistics”<p>Really can’t see what they can do now short of “bending” to Databricks and entering the competition. And naturally it’s no longer just enough that they show comparable performance. They have to hit their games stats somehow otherwise any news even of they beat Databricks will be reported as “see, we told you they where cheating”
Before the Snowflake blog post, I did not know what Snowflake or Databricks were. I can only imagine that this rivalry is great for both of them, even if Databricks is somewhat on the advantage end, at least from a tactical standpoint; I admit though that they seem to be a bit unnecessarily defensive considering the position they're in with the exchange.<p>In general though, I'm still not complaining. It's interesting to see a dispute like this unfold.
I would say that TPC-DS and TPC-H are really table stakes benchmarks for data warehouses at this point in time (maybe they weren't 10 years ago). How to build a database that does well on them is well documented in the literature now[1][2][3][4] (maybe a few other papers). Its not easy to build such a database, but its "just" hard work and many companies have the $$ necessary to do that work. There isn't any magic or technical moat in the results for databricks (or snowflake, or redshift, etc.).<p>I think Databricks is overly enthusiastic about their results as they have been trying to be competitive with cloud DWs on these benchmarks for a number of years now. They have finally caught up (by building deltalake and their photon query engine which implement a number of standard DW features).<p><pre><code> [1] http://www.vldb.org/pvldb/vol13/p1206-dreseler.pdf
[2] https://stratos.seas.harvard.edu/files/stratos/files/columnstoresfntdbs.pdf
[3] https://web.stanford.edu/class/cs245/readings/c- store.pdf
[4] http://sites.computer.org/debull/A12mar/vectorwise.pdf</code></pre>
As much as I love seeing competition in the space and am enjoying my popcorn, I really don't understand what Databricks is doing here: this feels like a childish foodfight rather than an obsession with the customer...
Ive been following this and it’s kind of embarrassing to watch.<p>I love working with Databricks and Snowflake. They both knock it out of the park for their respective use case. They’re amazing products.<p>It makes no sense to fall out about this though.<p>For a 100TB dataset with a funky calculation, Spark will trounce Snowflake. For a 1 row dataset, Snowflake will return before the spark job has been serialised.
Instead of blog posts written but experts in app A based on their experience with app B, I wish there were a platform for this kind of comparison.<p>Some objective third party sets the goal and then each company submits automation (selenium?) that configures their own app to achieve the goal. Entrants are scored by:<p>- time<p>- storage<p>- compute<p>- config complexity<p>No need to waste time making your opponent look bad, just focus on making your self look good, and do it on a level playing field.
This reminds me of the old performance ads of Oracle where they would show you how everything ran better on Oracle. They used to put those ads at airports, business lounges and the back cover of newspapers and magazines read by non-technical executives like the FT and Economist.<p>Everyone technical knew they would game every environment to come out with superior results. I suppose it worked. As the top executives buy big system software and ignore the IT crowd who could easily point out the flaws in the methodology of the"studies".<p>Breakdown of one of those example ads:<p><a href="https://db2news.wordpress.com/2011/06/08/a-closer-examination-of-oracles-database-performance-advertisement/" rel="nofollow">https://db2news.wordpress.com/2011/06/08/a-closer-examinatio...</a>
tl;dr: The data warehouse company used a pre-baked TPC-DS dataset and claimed they have similar performance to Databricks. Turns out if you use the official TPC-DS data generation scripts, you get much worse performance.
Serious question: Databricks, Snowflake, Dremio. All these "Data" platform companies => which one do you have for your Data Lake and Data Warehouse solution?<p>I'm sick and tired of these companies Snake Oiling the Data industry by offering "the easiest" platform to satisfy your Data Lake + Warehouse solution only to fall hard whenever you hook it up with your production data (big dataset).<p>PS: Anyone selling Data Lakehouse (Data Lake + Warehouse as one platform) is on meth.