We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.<p>Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkM<p>If you are someone who prefer text, here’s the quick TLDR;<p>Why Debezium became a drag for them:
1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch
2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3”
3. Handling heterogeneous arrays required custom SMTs
4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows
5. Ongoing schema drift demanded extra code to keep Iceberg tables aligned<p>What changed with OLake?
-> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between<p>-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config
-> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later<p>-> Resumable, chunked full loads: a pod crash resumes instead of restarting<p>-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.<p>Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.<p>Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.<p>(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)<p>Check out github repo - https://github.com/datazip-inc/olake