TechEcho

8 comments

njaover 3 years ago

Debezium is a useful tool, but requires a lot of babysitting. If the DB connection blips or DNS changes (say, if you just rebuilt your prod db), or in some other cases, it'll die and present this exact problem. Fortunately, it's easy to enable a "heartbeat" topic to alert on to make sure it can be restarted before the db disk fills (of course, db size growth alerts are critical too).We've found that it's worth it for most use cases to switch to a vanilla JDBC Kafka Connector with frequent polling. This also allows for cases such as emitting joined data.Other than Debezium, Postgres + Kafka + Kafka Connect builds a pretty stable system for sending data around all our different dbs, apps, and data lakes.

评论 #29482369 未加载

评论 #29480581 未加载

aeyesover 3 years ago

The write activity every 5 minutes is standard Postgres checkpointing, the default value for checkpoint_timeout is 5 minutes. This is not limited to RDS.Background processes like vacuum and analyze also write to WAL.

评论 #29484986 未加载

asjfkdlfover 3 years ago

Had this exact thing happen in production when we turned off an audit DB replication slot. We got lucky and caught it before our entire app went down. It’s one of the many foot-guns we have found with Postgres.

评论 #29481597 未加载

评论 #29482035 未加载

评论 #29488367 未加载

gregplaysguitarover 3 years ago

I heard an interesting comment recently from Derek Collison (creator of NATS[1]) that durability and delivery requirements can have the unwanted side-effect that one consumer can adversely impact all the others. It didn’t immediately make sense then, but this seems like a succinct illustration of the point![1] <a href="https://NATS.io" rel="nofollow">https://NATS.io</a>

bufferoverflowover 3 years ago

I still don't get what caused 100GB of logs. Idle events every 5 minutes should not generate that much data, so what am I missing?

评论 #29479383 未加载

tbrockover 3 years ago

We ran into this too, i actually think its a terrible postgres default. Logical replication slots should have timeouts where if you haven't read the WAL record in say 24 hours it should be dropped. Make it configurable and set a sane default, problem solved.You'd have to resync the followers/secondaries, big deal, its way better than the primary going down because its disk filled up. This failure mode is awful. On RDS its relatively painless because you can snap fingers and have more disk but if you are running it yourself? Good luck.In practice, mongodb’s oplog mechanism, for example, which acts as a circular buffer with a set size is a much more tolerant implementation. If the oplog rolls over befrore you've read it just resync but at most its taken up 10% of your disk.

评论 #29480421 未加载

评论 #29480572 未加载

TYMorningCoffeeover 3 years ago

I thought it was going to be a feedback loop, where the write at the end would trigger another Kafka message. I was pleasantly surprised it wasn't.

avg_devover 3 years ago

Just wanted to say thanks for this article. I have been exploring Debezium for capturing Pg changes at work. I don’t know a whole lot about Pg replication and it’s nice to hear the potential gotchas before moving to anything production-like.

8 comments

njaover 3 years ago

评论 #29482369 未加载

评论 #29480581 未加载

aeyesover 3 years ago

评论 #29484986 未加载

asjfkdlfover 3 years ago

评论 #29481597 未加载

评论 #29482035 未加载

评论 #29488367 未加载

gregplaysguitarover 3 years ago

bufferoverflowover 3 years ago

I still don't get what caused 100GB of logs. Idle events every 5 minutes should not generate that much data, so what am I missing?

评论 #29479383 未加载

tbrockover 3 years ago

评论 #29480421 未加载

评论 #29480572 未加载

TYMorningCoffeeover 3 years ago

I thought it was going to be a feedback loop, where the write at the end would trigger another Kafka message. I was pleasantly surprised it wasn't.

avg_devover 3 years ago

Postgres, Kafka, and a mysterious 100 GB

8 comments

Postgres, Kafka, and a mysterious 100 GB

8 comments