Debezium is a useful tool, but requires a lot of babysitting. If the DB connection blips or DNS changes (say, if you just rebuilt your prod db), or in some other cases, it'll die and present this exact problem. Fortunately, it's easy to enable a "heartbeat" topic to alert on to make sure it can be restarted before the db disk fills (of course, db size growth alerts are critical too).<p>We've found that it's worth it for most use cases to switch to a vanilla JDBC Kafka Connector with frequent polling. This also allows for cases such as emitting joined data.<p>Other than Debezium, Postgres + Kafka + Kafka Connect builds a pretty stable system for sending data around all our different dbs, apps, and data lakes.
The write activity every 5 minutes is standard Postgres checkpointing, the default value for checkpoint_timeout is 5 minutes. This is not limited to RDS.<p>Background processes like vacuum and analyze also write to WAL.
Had this exact thing happen in production when we turned off an audit DB replication slot. We got lucky and caught it before our entire app went down. It’s one of the many foot-guns we have found with Postgres.
I heard an interesting comment recently from Derek Collison (creator of NATS[1]) that durability and delivery requirements can have the unwanted side-effect that one consumer can adversely impact all the others. It didn’t immediately make sense then, but this seems like a succinct illustration of the point!<p>[1] <a href="https://NATS.io" rel="nofollow">https://NATS.io</a>
We ran into this too, i actually think its a terrible postgres default. Logical replication slots should have timeouts where if you haven't read the WAL record in say 24 hours it should be dropped. Make it configurable and set a sane default, problem solved.<p>You'd have to resync the followers/secondaries, big deal, its way better than the primary going down because its disk filled up. This failure mode is awful. On RDS its relatively painless because you can snap fingers and have more disk but if you are running it yourself? Good luck.<p>In practice, mongodb’s oplog mechanism, for example, which acts as a circular buffer with a set size is a much more tolerant implementation. If the oplog rolls over befrore you've read it just resync but at most its taken up 10% of your disk.
I thought it was going to be a feedback loop, where the write at the end would trigger another Kafka message. I was pleasantly surprised it wasn't.
Just wanted to say thanks for this article. I have been exploring Debezium for capturing Pg changes at work. I don’t know a whole lot about Pg replication and it’s nice to hear the potential gotchas before moving to anything production-like.