科技回声

We recently hosted a small online meetup at OLake where a Data Engineer at PhysicsWallah, walked through why his team dropped Debezium and moved to OLake’s “MongoDB → Iceberg” pipeline.Video (29 min): https://www.youtube.com/watch?v=qqtE_BrjVkMIf you are someone who prefer text, here’s the quick TLDR;Why Debezium became a drag for them: 1. Long full loads on multi-million-row MongoDB collections, and any failure meant restarting from scratch 2. Kafka and Connect infrastructure felt heavy when the end goal was “Parquet/Iceberg on S3” 3. Handling heterogeneous arrays required custom SMTs 4. Continuous streaming only; they still had to glue together ad-hoc batch pulls for some workflows 5. Ongoing schema drift demanded extra code to keep Iceberg tables alignedWhat changed with OLake? -> Writes directly from MongoDB (and friends) into Apache Iceberg, no message broker in between-> Two modes: full load for the initial dump, then CDC for ongoing changes — exposed by a single flag in the job config -> Automatic schema evolution: new MongoDB fields appear as nullable columns; complex sub-docs land as JSON strings you can parse later-> Resumable, chunked full loads: a pod crash resumes instead of restarting-> Runs as either a Kubernetes CronJob or an Airflow task; config is one YAML/JSON file.Their stack in one line: MongoDB → OLake writer → Iceberg on S3 → Spark jobs → Trino / occasional Redshift, all orchestrated by Airflow and/or K8s.Posting here because many of us still bolt Kafka onto CDC just to land files. If you only need Iceberg tables, a simpler path might exist now. Curious to hear others’ experiences with broker-less CDC tools.(Disclaimer: I work on OLake and hosted the meetup, but the talk is purely technical.)Check out github repo - https://github.com/datazip-inc/olake

Debezium to olake.io – PhysicsWallah switch for CDC

1 comment

Debezium to olake.io – PhysicsWallah switch for CDC

1 comment