It's really easy to get lost in the technical jargon that the <i>vendors</i> who are <i>selling products</i> throw around, but this article has missed the important part, and spent all the time talking about the relatively unimportant part (data formats).<p>You need to step back and look from a broader perspective to understand this domain.<p>Talking about arrow/parquet/iceberg is like talking about InnoDB vs MyISAM when you're talking about databases; yes, those are technically storage engines for mysql/mariadb, but no, you probably do not care about them until you need them, and you most certainly do not care about them when you want to understand what a relational DB vs. an no-SQL db are.<p>They are <i>technical details</i>.<p>...<p>So, if you step back, what you need to read about is <i>STAR SCHEMAS</i>. Here are some links (1), (2).<p>This is what people used to be before data lakes.<p>So the tldr: you have a big database which contains <i>condensed and annotated</i> versions of your data, which is easy to query, and structured in a way that is suitable for visualization tools such as PowerBI, Tableau, MicroStrategy (ugh, but people do use it), etc. to use.<p>This means you can generate <i>reports and insights</i> from your data.<p>Great.<p>...the problem is that generating this structured data from absolutely massive amounts of unstructured data involves a truly colossal amount of engineering work; and it's never realtime.<p>That's because the process of turning <i>raw data</i> into <i>a star schema</i> was traditionally done via ETL tools that were slow and terrible. 'Were'. These tools are still slow and terrible.<p>Basically, the output you get is very valuable, but <i>getting it</i> is very difficult, very expensive and both of those problems scale as the data size scales.<p>So...<p>Datalakes.<p>Datalakes are the solution to this problem; you don't transform the data. You just injest it and store it, basically raw, and <i>on the fly</i> when you need the data for something, you can process it.<p>The idea was something like a dependency graph; what if, instead of processing all your data every day/hour/whatever, you defined what data you needed, and then when you need it, you rebuild just that part of the database.<p>Certainly you don't get the nice star schema, but... you can handle a lot of data, and what you need to do process it 'adhoc' is pretty trivial mostly, so you don't need a huge engineering effort to support it; you just need some smart <i>table formats</i>, a <i>lot of storage</i> and on-demand compute.<p>...Great?<p>No. Totally rubbish.<p>Turn out this is a stupid idea, and what you get is a lot of data you can't get any insights from.<p>So, along come the 'nextgen' batch of BI companies like databricks so they invent this idea of a 'lake house' (3), (4).<p>What is it? Take a wild guess. I'll give you a hint: having no tables was a stupid idea.<p>Yes! Correct, they've invented a layer that sits on top of a data lake that presents a 'virtual database' with ACID transactions that you then build a star schema in/on.<p>Since the underlying implementation is (magic here, etc. etc. technical details) this approach supports output in the form we originally had (structured data suitable for analytics tools), but it has some nice features like streaming, etc. that make it capable of handling very large volumes of data; but it's not a 'real' database, so it does have some limitations which are difficult to resolve (like security and RBAC).<p>...<p>Of course, the promise, that you just pour all your data in and 'magic!' you have insights, is still just as much nonsense as it ever was.<p>If you use any of these tools now, you'll see that they require you to transform your data; usually as some kind of batch process.<p>If you closed your eyes and said "ETL?", you'd win a cookie.<p>All a 'lake house' is, is a traditional BI data warehouse built on a different type of database.<p>Almost without exception, everything else is marketing fluff.<p>* exception: kafka and streaming is actually fundamentally different for real time aggregated metrics, but its also fabulously difficult to do well, so most people still don't, as far as I'm aware.<p>...and I'll go out on a limb here and say really, you probably do not care if your implementation uses delta tables or iceberg; that's an implementation detail.<p>I <i>guarantee</i> that correctly understanding your domain data and modelling a form of it suitable for reporting and insights is more important and more valuable than what storage engine you use.<p>[1] - <a href="https://learn.microsoft.com/en-us/power-bi/guidance/star-schema" rel="nofollow">https://learn.microsoft.com/en-us/power-bi/guidance/star-sch...</a>
[2] - <a href="https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/books/data-warehouse-dw-toolkit/" rel="nofollow">https://www.kimballgroup.com/data-warehouse-business-intelli...</a><p>[3] - <a href="https://www.snowflake.com/guides/what-data-lakehouse" rel="nofollow">https://www.snowflake.com/guides/what-data-lakehouse</a>
[4] - <a href="https://www.databricks.com/glossary/data-lakehouse" rel="nofollow">https://www.databricks.com/glossary/data-lakehouse</a>