Understanding Parquet, Iceberg and Data Lakehouses

312 点作者 davidgomes超过 1 年前

24 条评论

twoodfin超过 1 年前

I often hear references to Apache Iceberg and Delta Lake as if they’re two peas in the Open Table Formats pod. Yet…Here’s the Apache Iceberg table format specification:<a href="https://iceberg.apache.org/spec/" rel="nofollow">https://iceberg.apache.org/spec/</a>As they like to say in patent law, anyone “skilled in the art” of database systems could use this to build and query Iceberg tables without too much difficulty.This is nominally the Delta Lake equivalent:<a href="https://github.com/delta-io/delta/blob/master/PROTOCOL.md">https://github.com/delta-io/delta/blob/master/PROTOCOL.md</a>I defy anyone to even scope out what level of effort would be required to fully implement the current spec, let alone what would be involved in keeping up to date as this beast evolves.Frankly, the Delta Lake spec reads like a reverse engineering of whatever implementation tradeoffs Databricks is making as they race to build out a lakehouse for every Fortune 1000 company burned by Hadoop (which is to say, most of them).My point is that I’ve yet to be convinced that buying into Delta Lake is actually buying into an open ecosystem. Would appreciate any reassurance on this front!Editing to append this GitHub history, which is unfortunately not reassuring:<a href="https://github.com/delta-io/delta/commits/master/PROTOCOL.md">https://github.com/delta-io/delta/commits/master/PROTOCOL.md</a>Random features and tweaks just popping up, PR’d by Databricks engineers and promptly approved by Databricks senior engineers…

评论 #38813077 未加载

评论 #38815179 未加载

评论 #38813725 未加载

评论 #38813071 未加载

评论 #38873303 未加载

评论 #38813837 未加载

评论 #38817078 未加载

wenc超过 1 年前

Great article. I've worked with Parquet files on S3 for years, but I didn't quite understand what Iceberg was, but the article explained it well. It's a database metadata format for an underlying set of data which describes its schema, partitioning etc.Most people use Hive partitioning convention (i.e. directory names like /key3=000/key2=002/) but Iceberg goes farther than this by exposing even more structure to the query engine.In a traditional DBMS like Postgres, the schema, the query engine and the storage format come as a single package.But with big data, we're building database components from scratch, and we can mix and match. We can use Iceberg as a metadata format, DuckDB as the query engine, Parquet as the storage format, and S3 as the storage medium.

评论 #38814166 未加载

benjaminwootton超过 1 年前

This is a big deal in the database world as delta, iceberg and hudi mean that data is being stored in an open source format, often on S3.It means that the storage and much of the processing is being standrdised so that you can move between databases easily and almost all tools will eventually be able to work with the same set of files in a transactionally sound way.For instance, Snowflake could be writing to a file, a data scientist could be querying the data live from a Jupyter notebook, and ClickHouse could be serving user facing analytics against the same data with consistency guarantees.If the business then decide to switch Snowflake to Databricks then it isn’t such a big deal.Right now it isn’t quite as fast to query these formats on S3 as a native ingestion would be, but every database vendor will be forced by the market to optimise for performance such that they tend towards the performance of natively ingested data.It’s a great win for openness and open source and for businesses to have their data in open and portable formats.Lakehouse has the same implications. Lots of companies have data lakes and data warehouses and end up copying data between the two. To query the same set of data and have just one system to manage is equally impactful.It’s a very interesting time to be in the data engineering world.

评论 #38813615 未加载

评论 #38814679 未加载

评论 #38815722 未加载

jamesblonde超过 1 年前

I disagree with this strongly - "The best way to store Apache Arrow dataframes in files on disk is with Feather. However, it’s also possible to convert to Apache Parquet format and others."The best way to build your own non-JVM lakehouse is to use Iceberg for metadata, Parquet for the Data, Query with DuckDB using Arrow tables (read Parquet directly into Arrow is very low cost), and then use Arrow->Pandas or Polars (either directly or via a service with Arrow Flight).If you put Feather in the mix, the whole Python lakehouse stack doesn't currently work.

评论 #38820455 未加载

debo_超过 1 年前

I've heard of data lakes, but "data lakehouse" sounds like where upper class data goes in the summer to take their data-boats data-fishing.

评论 #38813252 未加载

评论 #38815018 未加载

评论 #38816096 未加载

alentred超过 1 年前

I am very excited about Iceberg specifically (because open-source), but the last time I looked into it the only implementation was a Spark library, and Trino's (formerly Presto, an SQL engine) Iceberg connector had a hard dependency on Hive! It is like the entire industry had a hard time divorcing its MapReduce, Hive, and dare I to say Spark, legacy.I didn't look into Iceberg since, but plan to, and I am really looking forward for this to develop. We have the tools and the compute power today to deal with data without legacy tech, and not all data is big data either. Consequently "data engineering", thankfully, resembles the regular back-end development more and more, with its regular development practices being put in place.So, here is to the hope of having a pure Python Iceberg lib some day very soon!

评论 #38824142 未加载

评论 #38815174 未加载

throwitaway222超过 1 年前

Why is no one able to describe all this with more concrete ideas, like this is how you store data, this is how you connect and query - and how fast those queries will be (ie transactional speed vs "analytics" speed).

lysecret超过 1 年前

I am currently working with about 100TB data on GCP with BigQuery as a query engine and simple hive partitioning like /key3=000/key2=002/. We are happy because we can run all the queries you want and it is insanely cheap. But latency is reaching quite high levels (it doesn't matter so much for us) but I was wondering, if implementing Iceberg would improve this? Has anyone experience with this?Overall this kind of architecture is just awesome.

评论 #38816352 未加载

评论 #38817571 未加载

评论 #38815760 未加载

评论 #38818991 未加载

Lyngbakr超过 1 年前

> However, this blog post won’t be 100% comprehensive, or even the best starting point for most people. That’s because I’m writing this for myself. I find that the best way to learn new things is by "forcing myself" to re-explain them to others.I really like this attitude and have started embracing it myself both on paper and in notes on my website.

lmeyerov超过 1 年前

We have been excited to dig into the Iceberg era of more managed parquet storage... But they are still years behind on supporting fast GPU IO (GPUDirect/cuFile). So every time we look at bringing them to a customer for powering AI workloads... We hit that wall.It seems inevitable, more of a when vs if. Being able to have our cake & eat it too will be very cool :)

评论 #38813509 未加载

hawaiianSpork超过 1 年前

Parquet has been the lakehouse file format of choice for nearly half a decade. But we are starting to see other contenders that are optimized more for lower latency like lance <a href="https://github.com/lancedb/lance">https://github.com/lancedb/lance</a>

评论 #38815102 未加载

berniedurfee超过 1 年前

No mention of Hudi? I really liked using Hudi in a recent project. It feels so close to hitting that maturity level where it’s viable for a small team to maintain without introducing too many living parts.Overall, I like the whole concept of the Lakehouse because it can be done cheaply.Most datalakes turn into swamps pretty quickly, so cheaper is better.Let it sit unused for a while in S3 and then quietly nuke it without burning money on a big compute environment.

albert_e超过 1 年前

Sorry genuine question -- what does the phrase "at Broad" at the end of the blog post's title mean or refer to? Maybe a phrase that I am unfamiliar with? I first wondered if it is the name of an organization or team -- and this post is describing what they did in that team, but that doesn't seem to be the case?>> Understanding Parquet, Iceberg and Data Lakehouses at Broad

评论 #38815080 未加载

评论 #38812810 未加载

alexott超过 1 年前

Unity Catalog isn’t comparable with Iceberg Catalogs. It’s not required for Delta to function…There was a paper at VLDB about Delta Lake: <a href="https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf" rel="nofollow">https://www.vldb.org/pvldb/vol13/p3411-armbrust.pdf</a> - it describes why it was created, plus details of implementation.

pitah1超过 1 年前

Good in-depth insights into each format. This complements nicely with a site I created called tech-diff (<a href="https://tech-diff.com/file/" rel="nofollow">https://tech-diff.com/file/</a>) where it provides a summary of the file formats.

评论 #38818418 未加载

fancy_pantser超过 1 年前

I know it's a newcomer still under heavy development, but I'm surprised to not see Lance (and Lancedb atop it) mentioned. It crushes ORC and Parquet for most real-world data scenarios and has cheap data versioning.

Nelkins超过 1 年前

In every benchmark I've looked at online, Delta Lake format seems to have drastically better performance than Iceberg. Is this fundamental to the spec, or is it possible that Iceberg can close the gap?

Boxxed超过 1 年前

One thing I'm confused about is why does Iceberg need a spark deployment to function? Or am I wrong about that? I would rather avoid that ecosystem if I can.

评论 #38813479 未加载

评论 #38813238 未加载

jbmsf超过 1 年前

I appreciate the clarity of this article. I know it was written by the author for themselves, but it feels like it could have been written for me!

aejm超过 1 年前

I really liked your article.Is this a typo: “Hive, Delta Lake and Iceberg all support support of schema registry or metastore.”?

评论 #38823414 未加载

mulmen超过 1 年前

How do dependencies work in this type of data lakehouse? Does the orchestration layer handle that or is there metadata within the data lake that provides completeness information?

评论 #38815187 未加载

plopz超过 1 年前

Are these formats appropriate for multi dimensional gridded data or are hdf/netcdf still what people use for those?

评论 #38816942 未加载

wokwokwok超过 1 年前

It's really easy to get lost in the technical jargon that the vendors who are selling products throw around, but this article has missed the important part, and spent all the time talking about the relatively unimportant part (data formats).You need to step back and look from a broader perspective to understand this domain.Talking about arrow/parquet/iceberg is like talking about InnoDB vs MyISAM when you're talking about databases; yes, those are technically storage engines for mysql/mariadb, but no, you probably do not care about them until you need them, and you most certainly do not care about them when you want to understand what a relational DB vs. an no-SQL db are.They are technical details....So, if you step back, what you need to read about is STAR SCHEMAS. Here are some links (1), (2).This is what people used to be before data lakes.So the tldr: you have a big database which contains condensed and annotated versions of your data, which is easy to query, and structured in a way that is suitable for visualization tools such as PowerBI, Tableau, MicroStrategy (ugh, but people do use it), etc. to use.This means you can generate reports and insights from your data.Great....the problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work; and it's never realtime.That's because the process of turning raw data into a star schema was traditionally done via ETL tools that were slow and terrible. 'Were'. These tools are still slow and terrible.Basically, the output you get is very valuable, but getting it is very difficult, very expensive and both of those problems scale as the data size scales.So...Datalakes.Datalakes are the solution to this problem; you don't transform the data. You just injest it and store it, basically raw, and on the fly when you need the data for something, you can process it.The idea was something like a dependency graph; what if, instead of processing all your data every day/hour/whatever, you defined what data you needed, and then when you need it, you rebuild just that part of the database.Certainly you don't get the nice star schema, but... you can handle a lot of data, and what you need to do process it 'adhoc' is pretty trivial mostly, so you don't need a huge engineering effort to support it; you just need some smart table formats, a lot of storage and on-demand compute....Great?No. Totally rubbish.Turn out this is a stupid idea, and what you get is a lot of data you can't get any insights from.So, along come the 'nextgen' batch of BI companies like databricks so they invent this idea of a 'lake house' (3), (4).What is it? Take a wild guess. I'll give you a hint: having no tables was a stupid idea.Yes! Correct, they've invented a layer that sits on top of a data lake that presents a 'virtual database' with ACID transactions that you then build a star schema in/on.Since the underlying implementation is (magic here, etc. etc. technical details) this approach supports output in the form we originally had (structured data suitable for analytics tools), but it has some nice features like streaming, etc. that make it capable of handling very large volumes of data; but it's not a 'real' database, so it does have some limitations which are difficult to resolve (like security and RBAC)....Of course, the promise, that you just pour all your data in and 'magic!' you have insights, is still just as much nonsense as it ever was.If you use any of these tools now, you'll see that they require you to transform your data; usually as some kind of batch process.If you closed your eyes and said "ETL?", you'd win a cookie.All a 'lake house' is, is a traditional BI data warehouse built on a different type of database.Almost without exception, everything else is marketing fluff.* exception: kafka and streaming is actually fundamentally different for real time aggregated metrics, but its also fabulously difficult to do well, so most people still don't, as far as I'm aware....and I'll go out on a limb here and say really, you probably do not care if your implementation uses delta tables or iceberg; that's an implementation detail.I guarantee that correctly understanding your domain data and modelling a form of it suitable for reporting and insights is more important and more valuable than what storage engine you use.[1] - <a href="https://learn.microsoft.com/en-us/power-bi/guidance/star-schema" rel="nofollow">https://learn.microsoft.com/en-us/power-bi/guidance/star-sch...</a> [2] - <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/" rel="nofollow">https://www.kimballgroup.com/data-warehouse-business-intelli...</a>[3] - <a href="https://www.snowflake.com/guides/what-data-lakehouse" rel="nofollow">https://www.snowflake.com/guides/what-data-lakehouse</a> [4] - <a href="https://www.databricks.com/glossary/data-lakehouse" rel="nofollow">https://www.databricks.com/glossary/data-lakehouse</a>

评论 #38813988 未加载

评论 #38817545 未加载

评论 #38813650 未加载

评论 #38815490 未加载

评论 #38821821 未加载

评论 #38815291 未加载

评论 #38813992 未加载

meehai超过 1 年前

can confirm that it is a nice thing to work with parquet files. Before this, we've worked for ~1 year with CSVs (I know the horror) and we made an effort to port all the 'legacy' code to Parquet filesWe interface with BigQuery (via Airflow) mostly, and except one very annoying situation it's a big improvement in terms of speed (parsing floats after querying the DB is NEVER a good option).---In case anyone's wondering, it's basically storing and loading native numpy arrays in BigQuery via the python client(s).You have a bunch of options (assume you have one or more cols with float32 numpy arrays):- dataframe -> to_parquet -> upload to GCS -> GCSToBigQueryOperator (<a href="https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/transfers/gcs_to_bigquery/index.html" rel="nofollow">https://airflow.apache.org/docs/apache-airflow-providers-goo...</a>)<pre><code> -> instead of storing as a `FLOAT, REPEATED` it will be stored as a STRUCT with a structure of `list>item` OR `list>element` (pyarrow==11 OR pyarrow==13).This requires a manual parsing from this 'json structure' that you get when querying the DB back to np.array -> slow and basically you are using CSVs again. -> Read more: https://stackoverflow.com/questions/68303327/unnecessary-list-item-nesting-in-bigquery-schemas-from-pyarrow-upload-dataframe -> set the schema before uploading? Nope, all values will uploaded as null in BQ. </code></pre> - dataframe -> bigquery.Client -> upload the dataframe from python<pre><code> - very slow, you need to batch your data (imagine 24h vs 5 minutes kind of slow as dataframe sizes increase + necessity to keep all data in memory or batch it so extra save/load of each batch before uploading) - arrays are stored properly </code></pre> - solution: you must do 2 things, one on the pyarrow side and one on the BigQuery side<pre><code> - `df.to_parquet(..., use_compliant_nested_type=True)` (in pyarrow==14 it's True by default, but airflow needs pyarrow==11, where it's False by default) - use `enable_list_inference=True` (link: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-parquet#list_logical_type) - when both of this are true (i.e. save parquet files [to GCS] using that flag and load parquet files [from GCS to BQ] using the other flag arrays can be stored as (FLOAT, REPEATED) and queried as numpy arrays out of the box without any manual management. </code></pre> This took me like 1 week of debugging and reading source code, obscure SO comments and GH issues etc.

评论 #38818376 未加载