Should you ditch Spark for DuckDB or Polars?

169 点作者 RobinL5 个月前

24 条评论

buremba5 个月前

Great post but it seems like you still rely on Fabric to run Spark NEE. If you're on AWS or GCP, you should probably not ditch Spark but combine both. DuckDB's gotcha is that it can't scale horizontally (multi-node), unlike Databricks. A single node can get you as far as you or can rent 2TB memory + 20TB NVME in AWS, and if you use PySpark, you can run DuckDB until it doesn't scale with its Spark integration (<a href="https://duckdb.org/docs/api/python/spark_api.html" rel="nofollow">https://duckdb.org/docs/api/python/spark_api.html</a>) and switch to Databricks if you need to scale out. That way, you get the best of the two worlds.DuckDB on AWS EC2's price performance rate is 10x that of Databricks and Snowflake with its native file format, so it's a better deal if you're not processing petabyte-level data. That's unsurprising, given that DuckDB operates in a single node (no need for distributed shuffles) and works primarily with NVME (no use of object stores such as S3 for intermediate data). Thus, it can optimize the workloads much better than the other data warehouses.If you use SQL, another gotcha is that DuckDB doesn't have advanced catalog features in cloud data warehouses. Still, it's possible to combine DuckDB compute and Snowflake Horizon / Databricks Unity Catalog thanks to Apache Iceberg, which enables multi-engine support in the same catalog. I'm experimenting this multi-stack idea with DuckDB <> Snowflake, and it works well so far: <a href="https://github.com/buremba/universql">https://github.com/buremba/universql</a>

评论 #42426882 未加载

评论 #42426593 未加载

kianN5 个月前

I went through this trade off at my last job. I started off migrating my adhoc queries to duckdb directly from delta tables. Over time, I used duckdb enough to do some performance tuning. I found that migrating from Delta to duckdb's native file format provided substantial speed wins.The author focuses on read/write performance on Delta (makes sense for the scope of the comparison). I think if an engineer is considering switching from spark to duckdb/polars for their data warehouse, they would likely be open to data formats other than Delta, which is tightly coupled to the spark (and even more so to the closed-source Databricks implementation). In my use case, we saw enough speed wins and cost savings that it made sense to fully migrate our data warehouse to a self managed duckdb warehouse using duckdb's native file format.

评论 #42425891 未加载

评论 #42428901 未加载

评论 #42426301 未加载

serjester5 个月前

Polars is much more useful if you’re doing complex transformations instead of basic ETL.Something under appreciated about polars is how easy it is to build a plugin. I recently took a rust crate that reimplemented the h3 geospatial coordinate system, exposed it at as a polars plugin and achieved performance 5X faster than the DuckDB version.With knowing 0 rust and some help from AI it only took me 2ish days - I can’t imagine doing this in C++ (DuckDB).[1] <a href="https://github.com/Filimoa/polars-h3">https://github.com/Filimoa/polars-h3</a>

评论 #42429777 未加载

评论 #42428275 未加载

moandcompany5 个月前

My opinion: the high-prevalence of implementations using Spark, Pandas, etc. are mostly driven by (1) people's tendency to work with tools that use APIs they are already familiar with, (2) resume driven development, and/or to a much lesser degree (3) sustainability with regard to future maintainers, versus what may be technically sensible with regard to performance. A decade ago we saw similar articles referencing misapplications of Hadoop/Mapreduce, and today it is Spark as its successor.Pandas' use of the dataframe concepts and APIs were informed by R and a desire to provide something familiar and accessible to R users (i.e. ease of user adoption).Likewise, when the Spark development community somewhere around the version 0.11 days began implementing the dataframe abstraction over its original native RDD abstractions, it understood the need to provide a robust Python API similar to the Pandas APIs for accessibility (i.e. ease of user adoption).At some point those familiar APIs also became a burden, or were not-great to begin with, in several ways and we see new tools emerge like DuckDB and Polars.However, we now have a non-unique issue where people are learning and applying specific tools versus general problem-solving skills and tradecraft in the related domain (i.e. the common pattern of people with hammers seeing everything as nails). Note all of the "learn these -n- tools/packages to become a great ____ engineer and make xyz dollars" type tutorials and starter-packs on the internet today.

评论 #42427328 未加载

评论 #42425464 未加载

pnut5 个月前

Maybe this is telling more of the company I work in, but it is just incomprehensible for me to casually contemplate dumping a generally comparable, installed production capability.All I think when I read this is, standing up new environments, observability, dev/QA training, change control, data migration, mitigating risks to business continuity, integrating with data sources and sinks, and on and on...I've got enough headaches already without another one of those projects.

RobinL5 个月前

I submitted this because I thought it was a good, high effort post, but I must admit I was surprised by the conclusion. In my experience, admittedly on different workloads, duckdb is both faster and easier to use than spark, and requires significantly less tuning and less complex infrastructure. I've been trying to transition as much as possible over to duckdb.There are also some interesting points in the following podcast about ease of use and transactional capabilities of duckdb which are easy to overlook (you can skip the first 10 mins): <a href="https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb" rel="nofollow">https://open.spotify.com/episode/7zBdJurLfWBilCi6DQ2eYb</a>Of course, if you have truly massive data, you probably still need spark

评论 #42426262 未加载

评论 #42426280 未加载

评论 #42426295 未加载

rapatel05 个月前

The author disparages ibis but i really think that this is short sighted. Ibis does a great job of mixing sql with dataframes to perform complex queries and abstracts away a lot of the underlyng logic and allows for query optimization.Example:df = (<pre><code> df.mutate(new_column=df.old_column.dosomething()) .alias('temp_table') .sql('SELECT db_only_function(new_column) AS newer_column from temp_table') .mutate(other_new_column = newer_column.do_other_stuff()) </code></pre> )It's super flexible and duckdb makes it very performant. The general vice i experience creating overly complex transforms but otherwise it's super useful and really easy to mix dataframes and SQL. Finally it supports pretty much every backend including pyspark and polars

memhole5 个月前

Nice write up. I don’t think the comments about duckdb spilling to disk are correct. I believe if you create a temp or persistent db duckdb will spill to disk.I might have missed it, but the integration of duckdb and the arrow library makes mixing and matching dataframes and sql syntax fairly seamless.I’m convinced the simplicity of duckdb is worth a performance penalty compared to spark for most workloads. Ime, people struggle with fully utilizing spark.

评论 #42425400 未加载

评论 #42428444 未加载

评论 #42425171 未加载

pm905 个月前

> My name is Miles, I’m a Principal Program Manager at Microsoft. While a Spark specialist by roleJFYI. I think the article itself is pretty unbiased but I feel like its worth putting this disclaimer for the author.

评论 #42425272 未加载

评论 #42425311 未加载

steveBK1235 个月前

I do wonder if some new tech adoption will actually be slowed due to the prevalence of LLM assisted coding?That is - all these code assistants are going to be 10x as useful on spark/pandas as they would be on duckdb/polars, due to the age of the former and the continued rate of change in the latter.

评论 #42428236 未加载

sroerick5 个月前

I like Spark a lot. When I was doing a lot of data work, DataBricks was a great tool for a team that wasn’t elbow deep into data engineering to be able to get a lot of stuff done.DuckDB is fantastic, though. I’ve never really built big data streaming situations so I can really accomplish anything I’ve needed with DuckDB. I’m not sure about building full data pipelines with it. Any time I’ve tried, it feels a little “duck-tapey” but the ecosystem has matured tremendously in the past couple years.Polars never got a lot of love from me, though I love what they’re doing. I used to do a lot of work in pandas and python but I kind of moved onto greener pastures. I really just prefer any kind of ETL work in SQL.Compute was kind of always secondary to developer experience for me. It kills me to say this but my favorite tool for data exploration is still PowerBI. If I have a strange CSV it’s tough to beat dragging it into a BI tool and exploring it that way. I’d love something like DuckDB Harlequin but for BI / Data Visualization. I don’t really love all the SaaS BI platforms I’ve explored. I did really like Plotly.Totally open to hearing other folks’ experiences or suggestions. Nothing here is an indictment of any particular tool, just my own ADD and the pressure of needing to ship.

评论 #42427770 未加载

nicornk5 个月前

The blog post echos my experience that duckDB just works (due to superior disk spilling capabilities) and polars OOMs a lot.

评论 #42425519 未加载

ibgeek5 个月前

Good write up. The only real bias I can detect is that the author seems to conflate their (lack of) familiarity with ease of use. I bet if they spent a few months using DuckDB and Polars on a daily basis, they might find some of the tasks just as easy or easier to implement.

ZeroCool2u5 个月前

Would love to see Lake Sail overtake Spark, so we could generally dodge tuning the JVM for big Spark jobs.<a href="https://docs.lakesail.com/sail/latest/" rel="nofollow">https://docs.lakesail.com/sail/latest/</a>

评论 #42427311 未加载

shcheklein5 个月前

Another alternative to consider is <a href="https://www.getdaft.io/" rel="nofollow">https://www.getdaft.io/</a> . AFAIU it is a more direct competitor to Spark (distributed mode).

评论 #42428497 未加载

Epa0955 个月前

One thing that's not clear to me about the 'use duckdb instead' proposals is how to orchestrate the batches. In databricks/spark there are two components:- auto loader/cloud files. Can attach to a blob storage of e.g csv or json, and give them as batches. As new files come in you get batches containing only the new files.-structured streaming and it's checkpoints. It keeps tracks across runs of how far in the source it has read (including cloud files sources), and it's easy to either continue the job with only the new data, or delete the checkpoint and rebuild everything.How can you do something similar with duckdb? If you have e.g a growing blob store of csv/avro/json files? Just read everything every day? Create some homegrown setup?I guess what I describe above is independent of the actual compute library, you could use any transformation library to do the actual batches (and with foreachbatch you can actually use duckdb in spark like this).

评论 #42429057 未加载

adsharma5 个月前

I'm a bit confused by the claim that duckdb doesn't support dataframes.This blog post suggests that it has been supported since 2021 and matches my experience.<a href="https://duckdb.org/2021/05/14/sql-on-pandas.html" rel="nofollow">https://duckdb.org/2021/05/14/sql-on-pandas.html</a>

评论 #42426328 未加载

评论 #42426207 未加载

craydandy5 个月前

Interesting and well-written article. Thanks to the author for writing it. Replacing Spark with these single-machine tools seems to be on the hype, and Spark is not en vogue anymore.The author ran Spark in Fabric, which has V-Order write enabled by default. DuckDB and Polars don't have this, as it's an MS proprietary algorithm. V-Order adds about 15% overhead to write, so it does change the result a bit.The data sizes were bit on a large size, at least for the data amounts I see daily. There definitely are tables in the 10GB, 100GB, and even in 1TB size range, but most tables traveling through data pipelines are much smaller.

评论 #42431647 未加载

ritchie465 个月前

I do think code should be shared when you are benchmarking. He could be using Polars' eager API for instance, which would not be apples to apples.

评论 #42428432 未加载

OutOfHere5 个月前

Isn't Spark extremely memory inefficient due to the use of Java?

评论 #42428008 未加载

tessierashpool95 个月前

what is the current maximum ball park amount of data one can realistically handle on a single machine setup on AWS / GCP / Azure?realistically means keeping in mind that the processing itself also requires memory as well as prerequisites like indexes which also need to be kept in memory.maximum memory at AWS would be 1.5TB using r8g.metal-48xl. so, assuming 50% usable for the raw data means about 750GB are realistic.

jimmyl025 个月前

pretty new to the large scale data processing space so not sure if this is a known question but isn't the purpose of spark that it can be distributed across many workers and parallelized?I guess the scale of data here ~100GB is manageable with something like DuckDB but once data gets past a certain scale, wouldn't single machine performance have no way of matching a distributed spark cluster?

评论 #42425260 未加载

评论 #42425244 未加载

评论 #42425100 未加载

评论 #42425411 未加载

评论 #42425051 未加载

code515 个月前

Why is Apache DataFusion not there as an alternative?

downrightmike5 个月前

Or use <a href="https://lancedb.com/">https://lancedb.com/</a> LanceDB is a developer-friendly, open source database for AI. From hyper scalable vector search and advanced retrieval for RAG, to streaming training data and interactive exploration of large scale AI datasets, LanceDB is the best foundation for your AI applicationRecent podcast <a href="https://talkpython.fm/episodes/show/488/multimodal-data-with-lancedb" rel="nofollow">https://talkpython.fm/episodes/show/488/multimodal-data-with...</a>

评论 #42426311 未加载