Iceberg (<a href="https://iceberg.apache.org" rel="nofollow">https://iceberg.apache.org</a>) is an open source alternative to Delta Lake that I cannot recommend enough.
It organizes your Parquet files (or other serialization formats) in a logical structure with snapshots to allow time travel and git-like semantics for data management and Write-Audit-Publish strategies.
My favorite use recently is the idempotent change data capture to ease replication in the event of failures. When your publishing job fails, you can simply replay the same diff between two snapshots and pick up where you left off.
Comparing Delta Lake to Parquet is a bit nonsense isn't it? Like comparing Postgres to a zip file. After trying all of the major open table formats, Iceberg in the future in my opinion. Delta is great if you use Databricks but otherwise I don't see a compelling reason to use it over Iceberg.
More comparisons (from a competitor?):<p>"Apache Hudi vs Delta Lake vs Apache Iceberg - Data Lakehouse Feature Comparison"
<a href="https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-apache-iceberg-lakehouse-feature-comparison" rel="nofollow">https://www.onehouse.ai/blog/apache-hudi-vs-delta-lake-vs-ap...</a>
I'm not well versed in these things, but at this point, aren't you re-inventing database systems? Talking about things like ACID transactions, schema evolution, dropping columns, ... in the context of a file-format feels bizarre to me.
Isn't delta lake using parquet files? I don't understand the comparison.<p>Also<p>> Parquet tables are OK when data is in a single file but are hard to manage and unnecessarily slow when data is in many files<p>This is not true. Having worked with Spark it's much better to have multiple "small" files than only one big file.
Delta is pretty great, let's you do upserts into tables in DataBricks much easier than without it.<p>I think the website is here: <a href="https://delta.io" rel="nofollow">https://delta.io</a>
Delta is nice, but a lot of features are missing from the FOSS version.<p>Hudi is nice, but they are in the middle of a big format change right now.<p>Iceberg is nice, but is the most conservative and slow format out of three.
Databricks has been struggling to defend Delta against the fast-moving improvements and widening adoption of Iceberg, championed by two of its major competitors, AWS and Snowflake. This article seems like a bizarre, and maybe even misleading, artifact, given that no one in the industry is comparing Parquet to Delta. They’re weighing Iceberg, which like Delta, can organize and structure groups of parquet (or other format) files…
Data Lakes (i.e. Parquet files in storage without a metadata layer) don't support transactions, require expensive file listing operations, and don't support basic DML operations like deleting rows.<p>Delta Lake stores data in Parquet files and adds a metadata layer to provide support for ACID transactions, schema enforcement, versioned data, and full DML support. Delta Lake also offers concurrency protection.<p>This post explains all the features offered by Delta Lake in comparison to a plain vanilla Parquet data lake.