TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Debezium to olake.io – PhysicsWallah switch for CDC

3 pointsby pkhodiyar22 days ago
We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.<p>Video (29 min): https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=qqtE_BrjVkM<p>If you are someone who prefer text, here’s the quick TLDR;<p>Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet&#x2F;Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned<p>What changed with OLake? -&gt; Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between<p>-&gt; Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -&gt; Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later<p>-&gt; Resumable, chunked full loads: a pod crash resumes instead of restarting<p>-&gt; Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML&#x2F;JSON file.<p>Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino &#x2F; occasional Redshift, all orchestrated by Airflow and&#x2F;or K8s.<p>Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.<p>(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)<p>Check out github repo - https:&#x2F;&#x2F;github.com&#x2F;datazip-inc&#x2F;olake

1 comment

R_khameshra20 days ago
This looks great!! Do you guys support schema evolution as well, because Synching MongoDB data comes with that challenge?