Great post but it seems like you still rely on Fabric to run Spark NEE. If you're on AWS or GCP, you should probably not ditch Spark but combine both. DuckDB's gotcha is that it can't scale horizontally (multi-node), unlike Databricks. A single node can get you as far as you or can rent 2TB memory + 20TB NVME in AWS, and if you use PySpark, you can run DuckDB until it doesn't scale with its Spark integration (<a href="https://duckdb.org/docs/api/python/spark_api.html" rel="nofollow">https://duckdb.org/docs/api/python/spark_api.html</a>) and switch to Databricks if you need to scale out. That way, you get the best of the two worlds.<p>DuckDB on AWS EC2's price performance rate is 10x that of Databricks and Snowflake with its native file format, so it's a better deal if you're not processing petabyte-level data. That's unsurprising, given that DuckDB operates in a single node (no need for distributed shuffles) and works primarily with NVME (no use of object stores such as S3 for intermediate data). Thus, it can optimize the workloads much better than the other data warehouses.<p>If you use SQL, another gotcha is that DuckDB doesn't have advanced catalog features in cloud data warehouses. Still, it's possible to combine DuckDB compute and Snowflake Horizon / Databricks Unity Catalog thanks to Apache Iceberg, which enables multi-engine support in the same catalog. I'm experimenting this multi-stack idea with DuckDB <> Snowflake, and it works well so far: <a href="https://github.com/buremba/universql">https://github.com/buremba/universql</a>