Props to the DuckDB team. I've been using DuckDB for the last little while since I discovered it on HN and it's been simply amazing.<p>Before, I would reach for Apache Spark to run queries on local Parquet datasets (on a single machine), but I've started using DuckDB for that and it's super fast (much faster than Pandas) and unfussy to integrate in Python code (with PySpark you need all kinds of boilerplate code).<p>DuckDB is so lightweight that it's also great for quick interactive work in Jupyter or IPython.<p>I also use it to do cross-format joins between Parquet (immutable) and CSV files (mutable) -- DuckDB can load both into the same environment -- which makes it easy to solve different kinds of programming problems. The dynamic stuff goes into the CSV files while the Parquet dataset remains static. For instance, when processing a large Parquet dataset, my code keeps track of groups that I'vee already processed in a CSV file. If the program is interrupted and I need to resume from where I left off, I just do a DuckDB join between Parquet and CSV and exclude already processed groups (particularly when you don't have a group key, and you're grouping by several fields). Yes, you can do all this with Spark too, but the DuckDB code is so much simpler and compact.<p>For large Parquet datasets, I currently roll my own code to chunk them so I can process them out-of-core, but sounds like this latest streaming feature in DuckDB takes care of that detail.<p>Sure, DuckDB doesn't do distributed compute like Spark, but as a SQL engine for Parquet, I find it's so much more ergonomic than Spark.