Apache iceberg the Hadoop of the modern-data-stack?

115 pointsby samrohn3 months ago

12 comments

This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.

评论 #43280167 未加载

评论 #43287811 未加载

评论 #43286353 未加载

评论 #43279001 未加载

mritchie7123 months ago

This is a bit overblown.Is Iceberg "easy" to set up? No.Can you get set up in a week? Yes.If you really need a datalake, spending a week setting it up is not so bad. We have a guide[0] here that will get you started in under an hour.For smaller (e.g. under 10tb) data where you don't need real-time, DuckDB is becoming a really solid option. Here's on setup[1] we've played around with using Arrow Flight.If you don't want to mess with any of this, we[2] spin it all up for you.0 - <a href="https://www.definite.app/blog/cloud-iceberg-duckdb-aws" rel="nofollow">https://www.definite.app/blog/cloud-iceberg-duckdb-aws</a>1 - <a href="https://www.definite.app/blog/duck-takes-flight" rel="nofollow">https://www.definite.app/blog/duck-takes-flight</a>2 - <a href="https://www.definite.app/" rel="nofollow">https://www.definite.app/</a>

评论 #43281725 未加载

评论 #43282853 未加载

评论 #43282669 未加载

simlevesque3 months ago

I'm working on an alternative Iceberg client to work better in write heavy use cases. Instead of many smaller files it writes on the same file until it's 1mb in size but it gives it a new name. Then I update the manifest to the new filename and checksum. I keep old files on disk for 60 seconds to allow pending queries. I'm also working on auto compaction, when I have ten 1mb files I compact them, same with ten 10mb files, etc...I feel like this could be a game changer for the ecosystem. It's more cpu and network heavy for writes but the reads are always fast. And the writes are still faster than pyiceberg.I want to hear opinions or how this could never work.

评论 #43282117 未加载

评论 #43281653 未加载

评论 #43281701 未加载

评论 #43282057 未加载

robertkoss3 months ago

This article is just shameless advertising for Estuary Flow, a company that the author is working for. "Operational Maturity", as if Iceberg, Delta or Hudi are not mature. These are battle-tested frameworks that have been in production for years. The "small files problem" is not really a problem because every framework supports some way of compacting smaller files. Just run a nightly job that compacts the small files and you're good 2 go.

评论 #43283635 未加载

Gasp0de3 months ago

Does anyone have a good alternative for storing large amounts of very small files that need to be individually queriable? We are dealing with a large amount of sensor readings that we need to be able to query on a per sensor basis and a timespan, and we are dealing with the problem mentioned in the article, that storing millions of small files in S3 is expensive.

评论 #43280367 未加载

评论 #43279492 未加载

评论 #43279563 未加载

评论 #43280433 未加载

评论 #43283081 未加载

评论 #43286236 未加载

alienreborn3 months ago

Better article (imo) on similar topic: <a href="https://www.dataengineeringweekly.com/p/is-apache-iceberg-the-new-hadoop" rel="nofollow">https://www.dataengineeringweekly.com/p/is-apache-iceberg-th...</a>

评论 #43281187 未加载

alexmorley3 months ago

Most of these issues will be ring true to lots of folk using Iceberg at the moment. But this does not:> Yet, competing table formats like Delta Lake and Hudi mirror this fragmentation. [ ... ] > Just as Spark emerged as the dominant engine in the Hadoop ecosystem, a dominant table format and catalog may appear in the Iceberg era.I think extremely few people are making bets on any other open source table format now - that consolidation has already happened in 2023-2024 (see e.g. Databricks who have their own competing format leaning heavily into iceberg; or adoption from all of the major data warehouse providers).

评论 #43279548 未加载

datax23 months ago

"Hadoop’s meteoric rise led many organizations to implement it without understanding its complexities, often resulting in underutilized clusters or over-engineered architectures. Iceberg is walking a similar path."This pain is too real, and too close to home. I've seen this outcome turn the entire business off of consuming their data via hadoop because it turns into a wasteland of delayed deliveries, broken datasets, op's teams who cannot scale, and architects overselling too robust designs.I've tried to scale down hadoop to the business user with visual etl tools like Alteryx, but there again compatibility between Alteryx and hadoop suck via ODBC connectors. I came from an AWS based stack into a poorly leapfrogged data stack and it's hard not to pull my hair out between the business struggling to use it and infra + op's not keeping up. Now these teams want to push to iceburg or big query while ignoring the mountains of tech debt they have created.Don't get me wrong Hadoop isn't a bad idea, its just complex and a time suck, and unless you have time to dedicate to properly deploy these solutions which most business do not, your implementation will suffer, your business will suffer."While the parallels to Hadoop are striking, we also have the opportunity to avoid its pitfalls." no one in IT learns from their failures unless they are writing the checks, most will flip before they feel the pain.

zhousun3 months ago

The only datastack iceberg (or lakehouse) will never replace is OLTP systems, for high-concurrency updates optimistic concurrency control & object store is simply a no go.Iceberg out-of-the-box is "NOT" good at streaming use cases, unlike formats like Hudi or Paimon, the table format does not have the concept of merge/ index. However, the beauty of iceberg is it is very unopinionated, so it is indeed possible to design an engine to stream write to iceberg. As far as I know this is how engines like Upsolver was implemented: 1. Have in-memory buffer to track incoming rows before flushing a version to iceberg (every 10s to a few minutes). 2. Build Indexing structure to write position deletes/ deletion vector instead of equality deletes. 3. The writer will all try to merge small files and optimize the table.And stay tuned, we at <a href="https://www.mooncake.dev/" rel="nofollow">https://www.mooncake.dev/</a> are working on a solution to mirror a postgres table to iceberg, and keep them always up-to-date.

orthoxerox3 months ago

I think the complexity of Iceberg is overblown. It's just a table format and it's strictly better than the Hive-style /schema/table/partition_key=partition_value/one_of_many_files.parquetIt has a lot of knobs to fiddle with (more than Delta Lake, which tries very hard to come up with good defaults), but even if you don't touch any of them, you already end up with tables that are as good as Hive's, except now your writers don't break your readers.This is already a massive boon that lets you escape the rigidity of a timetable schedule for your data pipelines. Anything else you can come up with (switching your table to MOR and rewriting it as a separate step etc) is further improvements.