Apache Iceberg

207 点作者 jacobmarble4 个月前

20 条评论

If you're looking to give Iceberg a spin, here's how to get it running locally, on AWS[0] and on GCP[1]. The posts use DuckDB as the query engine, but you could swap in Trino (or even chdb / clickhouse).0 - <a href="https://www.definite.app/blog/cloud-iceberg-duckdb-aws" rel="nofollow">https://www.definite.app/blog/cloud-iceberg-duckdb-aws</a>1 - <a href="https://www.definite.app/blog/cloud-iceberg-duckdb" rel="nofollow">https://www.definite.app/blog/cloud-iceberg-duckdb</a>

评论 #42845780 未加载

评论 #42833256 未加载

dm035144 个月前

I think iceberg solves a lot of big data problems, for handling huge amounts of data on blob storage, including partitioning, compaction and ACID semantics.I really like the way the catalog standard can decouple underlying storage as well.My biggest concern is how inaccessible the implementations are, Java / spark has the only mature implementation right now,Even DuckDB doesn’t support writing yet.I built out a tool to stream data to iceberg which uses the python iceberg client:<a href="https://www.linkedin.com/pulse/streaming-iceberg-using-sqlflow-turbolytics-d71pe/" rel="nofollow">https://www.linkedin.com/pulse/streaming-iceberg-using-sqlfl...</a>

gopalv4 个月前

Hidden partitioning is the most interesting Iceberg feature, because most of the very large datasets are timeseries fact tables.I don't remember seeing that in Delta Lake [1], which is probably because the industry standard benchmarks use date as a column (tpc-h) or join date as a dimension table (tpc-ds) and do not use timestamp ranges instead of dates.[1] - <a href="https://github.com/delta-io/delta/issues/490">https://github.com/delta-io/delta/issues/490</a>

评论 #42827458 未加载

teleforce4 个月前

Apache Iceberg is one of the emerging Open Table Formats in addition to Delta Lake and Apache Hudi [1].[1] Open Table Formats:<a href="https://www.starburst.io/data-glossary/open-table-formats/" rel="nofollow">https://www.starburst.io/data-glossary/open-table-formats/</a>

评论 #42830025 未加载

评论 #42828755 未加载

pradeepchhetri4 个月前

ClickHouse has a solid Iceberg integration. It has an Iceberg table function[0] and Iceberg table engine[1] for interacting with Iceberg data stored in s3, gcs, azure, hadoop etc.[0] <a href="https://clickhouse.com/docs/en/sql-reference/table-functions/iceberg" rel="nofollow">https://clickhouse.com/docs/en/sql-reference/table-functions...</a>[1] <a href="https://clickhouse.com/docs/en/engines/table-engines/integrations/iceberg" rel="nofollow">https://clickhouse.com/docs/en/engines/table-engines/integra...</a>

评论 #42829109 未加载

volderette4 个月前

How do you query your iceberg tables? We are looking into moving away from Bigquery and Starrocks [1] looks like a good option.[1] <a href="https://www.starrocks.io/" rel="nofollow">https://www.starrocks.io/</a>

评论 #42828362 未加载

评论 #42845835 未加载

评论 #42835793 未加载

评论 #42829436 未加载

评论 #42828681 未加载

crorella4 个月前

What I like about iceberg is that the partitions of the tables are not tightly coupled to the subfolder structure of the storage layer (at least logically, at the end of the day the partitions are still subfolders with files), but at least the metadata is not tied to that, so you can change the partition of the tables going forward and still query a mix of old and new partitions time ranges.In the other hand, since one of the use cases they created it at Netflix was to consume directly from real time systems, the management of the file creation when updates to the data is less trivial (the CoW vs MoR problem and how to compact small files) which becomes important on multi-petabytes tables with lots of users and frequent updates. This is something I assume not a lot companies put a lot of attention to (heck, not even at Netflix) and have big performance and cost implications.

varsketiz4 个月前

I'm somewhat surprised to see it here - Iceberg is around for some time already.

评论 #42828624 未加载

评论 #42827396 未加载

nikolatt4 个月前

I've been looking at Iceberg for a while, but in the end went with Delta Lake because it doesn't have a dependency on a catalog. It also has good support for reading and writing from it without needing Spark.Does anyone know if Iceberg has plans to support similar use cases?

评论 #42829082 未加载

评论 #42829429 未加载

apwell234 个月前

I am stockholder in snowflake and iceberg's ascendance seems to coincide with snow's downfall.Is the query engine value add justify snowflake's valuation. Their data marketplace thing didn't seem to have actually worked.

评论 #42849486 未加载

mkl954 个月前

Iceberg on S3 tables is going to be a hot topic in the next few years.

npalli4 个月前

Are there robust non-JVM based implementations for Iceberg currently? Sorry to say, but recommending JVM ecosystems around large data just feels like professional malpractice at this point. Whether deployment complexity, resource overhead, tool sprawl or operational complexity the ecosystem seems to attract people who solve only 50% of the problem and have another tool to solve the rest, which in turn only solves 50% etc.. ad infinitum. The popularity of solutions like Snowflake, Clickhouse, or DuckDB is not an accident and is the direction everything should go. I hear Snowflake will adopt this in the future, that is good news.

评论 #42833188 未加载

rdegges4 个月前

OneHouse also has a fantastic iceberg implementation (they're the team behind Apache Hudi) and does a ton of great interop work: <a href="https://www.onehouse.ai/blog/comprehensive-data-catalog-comparison" rel="nofollow">https://www.onehouse.ai/blog/comprehensive-data-catalog-comp...</a> && <a href="https://www.onehouse.ai/blog/open-data-foundations-with-apache-xtable-hudi-delta-and-iceberg-interoperability" rel="nofollow">https://www.onehouse.ai/blog/open-data-foundations-with-apac...</a>

chehai4 个月前

In order to get good query performance from Iceberg, we have to run compaction frequently. Compaction turns out to be very expensive. Any tip to minimize compaction while keeping queries fast?

vonnik4 个月前

Curious to what extent Iceberg enables data composability and what the best complements and alternatives are.

评论 #42826968 未加载

评论 #42827227 未加载

jmakov4 个月前

Why would one choose this instead of DeltaLake?

jeffhuys4 个月前

Looks good, but come on… at least try to open your website on a mobile device.

评论 #42834314 未加载

dangoodmanUT4 个月前

iceberg is plauged with the problems it tries to solve, like being too tied to spark just to write data

评论 #42830323 未加载

honestSysAdmin4 个月前

Iceberg is a pretty cool guy, he consolidates the Parquet and doesn't afraid of anything.

rubenvanwyk4 个月前

And yet there's still no straightforward way to write directly to Iceberg tables from Javascript as far as I know.

评论 #42828047 未加载

评论 #42829114 未加载

评论 #42829401 未加载