科技回声

5 条评论

kgeist超过 2 年前

Oh, it took us around 2 years to have somewhat reliable event dispatching with transactional outboxes. We've run into so many edge cases under highload:- using the autoincrement ID as a pointer to the "last processed event" not knowing that MySQL reserves IDs non-atomically with inserts so it's possible for the relay to "see" an event with a bigger ID before an event with a smaller ID becomes visible, thus skipping some events- implementing event handlers with "only once" semantics instead of "at least once" (retries on error corrupting data)- event queue processing completely halting for the entire application when an event handler throws an error due to a bug and so gets retried infinitely- some other race conditions in the implementation when the order of events gets messed up- too fine-grained events without batching overflowing the queue (takes too much time for an event to get processed)- the relay getting OOMs due to excessive batching- once we had a funny bug when code which updated the last processed ID of the current DB shard (each client has their own shard) wrote to the wrong shards and so our relay started replaying thousands events from years ago- some event handlers always sending mail as part of processing events, so when it's retried on error or replayed (see the bug above) clients receive same emails multiple timesAnd still we have sometimes weird bugs like once a month for some reason we see a random event getting replayed in a deleted account, still tracking it down.

评论 #34658567 未加载

svieira超过 2 年前

This is interesting, but you've not actually solved the problem, just moved it. You still need cross-service transactions to publish the event only-once. Consider the case of "Publish the event to the queue. Fail to update / delete the entry in the event-buffer table." This is the "bad" pattern of push-then-store (with one exception - if you are fine with at-least-once message delivery instead of only-once). Likewise the "good" pattern of store-then-push has the same failure mode "Delete the entry from the buffer. Fail to publish the entry to the queue".That said, this does decouple the two operations which allows you to scale the publish side of the service separately from produce side (which can help when your architecture can produce multiple messages per storage event)

评论 #34658427 未加载

评论 #34656464 未加载

评论 #34656389 未加载

评论 #34656447 未加载

chucke超过 2 年前

On the topic, I created this ruby gem called tobox, essentially a transactional outbox framework: <a href="https://gitlab.com/os85/tobox" rel="nofollow">https://gitlab.com/os85/tobox</a>It actually circumvents most of the limitations mentioned in the article. Been successfully using it at work as an sns relay, for another app which is not even ruby.

groodt超过 2 年前

Any thoughts how to detect direct DML on the state table? Presumably allowing direct DML on the state table without the same on the Outbox table would lead to silent data corruption or lost updates.

mkleczek超过 2 年前

I must say not mentioning two phase commit and distributed transactions in this context seems strange.It looks like nowadays people forgot about XA and that MQ and DBMS can participate in a distributed transaction.

评论 #34658680 未加载

5 条评论

kgeist超过 2 年前

评论 #34658567 未加载

svieira超过 2 年前

评论 #34658427 未加载

评论 #34656464 未加载

评论 #34656389 未加载

评论 #34656447 未加载

chucke超过 2 年前

groodt超过 2 年前

Any thoughts how to detect direct DML on the state table? Presumably allowing direct DML on the state table without the same on the Outbox table would lead to silent data corruption or lost updates.

mkleczek超过 2 年前

评论 #34658680 未加载

Reliable event dispatching using a transactional outbox

5 条评论

Reliable event dispatching using a transactional outbox

5 条评论