TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Debezium to olake.io – PhysicsWallah switch for CDC

3 点作者 pkhodiyar22 天前
We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.<p>Video (29 min): https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=qqtE_BrjVkM<p>If you are someone who prefer text, here’s the quick TLDR;<p>Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet&#x2F;Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned<p>What changed with OLake? -&gt; Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between<p>-&gt; Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -&gt; Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later<p>-&gt; Resumable, chunked full loads: a pod crash resumes instead of restarting<p>-&gt; Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML&#x2F;JSON file.<p>Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino &#x2F; occasional Redshift, all orchestrated by Airflow and&#x2F;or K8s.<p>Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.<p>(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)<p>Check out github repo - https:&#x2F;&#x2F;github.com&#x2F;datazip-inc&#x2F;olake

1 comment

R_khameshra21 天前
This looks great!! Do you guys support schema evolution as well, because Synching MongoDB data comes with that challenge?