Distributed transactions in Go: Read before you try

98 点作者 roblaszczak8 个月前

17 条评论

ekidd8 个月前

Let's assume you're not a FAANG, and you don't have a billion customers.If you're gluing microservices together using distributed transactions (or durable event queues plus eventual consistency, or whatever), the odds are good that you've gone far down the wrong path.For many applications, it's easiest to start with a modular monolith talking to a shared database, one that natively supports transactions. When this becomes too expensive to scale, the next step may be sharding your backend. (It depends on whether you have a system where users mostly live in their silos, or where everyone talks to everyone. If your users are siloed, you can shard at almost any scale.)Microservices make sense when they're "natural". A video encoder makes a great microservice. So does a map tile generator.Distributed systems are expensive and complicated, and they kill your team's development velocity. I've built several of them. Sometimes, they turned out to be serious mistakes.As a rule of thumb:1. Design for 10x your current scale, not 1000x. 10x your scale allows for 3 consecutive years of 100% growth before you need to rebuild. Designing for 1000x your scale usually means you're sacrificing development velocity to cosplay as a FAANG.2. You will want transactions in places that you didn't expect.3. If you need transactions between two microservices, strongly consider merging them and having them talk to the same database.Sometimes you'll have no better choice than to use distributed transactions or durable event queues. They are inherent in some problems. But they should be treated as a giant flashing "danger" sign.

评论 #41756440 未加载

评论 #41757410 未加载

评论 #41757389 未加载

评论 #41756538 未加载

评论 #41758650 未加载

评论 #41759642 未加载

评论 #41766135 未加载

评论 #41757290 未加载

评论 #41757102 未加载

评论 #41758816 未加载

评论 #41758356 未加载

评论 #41757174 未加载

alphazard8 个月前

The best advice (organizationally) is to just do everything in a single transaction on top of Postgres or MySQL for as long as possible. This produces no cognitive overhead for the developers.Sometimes that doesn't deliver enough performance and you need to involve another datastore (or the same datastore across multiple transactions). At that point eventual consistency is a good strategy, much less complicated than distributed transactions. This adds a significant tax to all of your work though. Now everyone has to think through all the states, and additionally design a background process to drive the eventual consistency. Do you have a process in place to ensure all your developers are getting this right for every feature? Did you answer code review? Are you sure there's always enough time to re-do the implementation, and you'll never be forced to merge inconsistent "good enough" behavior?And the worst option (organizationally) is distributed transactions, which basically means a small group of talented engineers can't work on other things and need to be consulted for every new service and most new features and maintain the clients and server for the thumbs up/down system.If you make it hard to do stuff, then people will either 1. do less stuff, or 2. do the same amount of stuff, but badly.

Scubabear688 个月前

If I had a nickel for all the clients I’ve seen with micro services everywhere, and 90% of the code is replicating an RDBMS with hand coded in memory joins.What could have been a simple SQL query in a sane architecture becomes N REST calls (possibly nested with others downstream) and manually stitching together results.And that is just the read only case. As the author notes updates add another couple of levels of horror.

评论 #41758482 未加载

pjmlp8 个月前

As usual, don't try to use the network boundary to do what modules already offer in most languages.Distributed systems spaghetti is much worse to deal with.

wwarner8 个月前

AGREE! The author's point is very well argued. Beginning a transaction is almost never a good idea. Design your data model so that if two pieces of data must be consistent, they are in the same row, and allow associated rows to be missing, handling nulls in the application. Inserts and updates should operate on a single table, because in the case of failure, nothing changed, and you have a simple error to deal with. In short, as explained in the article, embrace eventual consistency. There was a great post from the Github team about why they didn't allow transactions in their rails app, from around 2013, but I can't find it for the life of me.I realize that you're staring at me in disbelief right now, but this is gospel!

relistan8 个月前

This is a good summary of building an evented system. Having built and run one that scaled up to 130 services and nearly 60 engineers, I can say this solves a lot of problems. Our implementation was a bit different but in a similar vein. When the company didn’t do well in the market and scaled down, 9 engineers were (are) able to operate almost all of that same system. The decoupling and lack of synchronous dependencies means failures are fairly contained and easy to rectify. Replays can fix almost anything after the fact. Scheduled daily replays prevent drift across the system, helping guarantee that things are consistent… eventually.

评论 #41760428 未加载

misiek088 个月前

Looks like crypto ad for the library and showing probably worst, most over-engineered method for „solving” transactions in more diverse environment. Eventually consistency is big tradeoff not possible to accept in many payment and stock related areas. Working in company where all described problems exist and were solved the worst way possible I see this article as very misleading. You don’t want the events instead of transactions - if something has to be commited together - you need to reachitect system and that’s it. Of course people who were building this monster for years will block anyone from doing this. Over-engineered AF, because most of the parts where transactions are required could be handled by single database, even SQL and currently are split between dozens of separate Mongo clusters."Event based consistency" leaves us in state where you can’t restore system to a stable, safe, consistent state. And of course you have a lot more fun in debugging and developing, because you (we, here) can’t test locally anything. Hundreds of mini-clones of prod setup running, wasting resources and always out of sync are ready to see the change and tell you a little more than nothing. Great DevEx…

kgeist8 个月前

Main source of pain with eventual consistency is lots of customer calls/emails "we did X but nothing happened". I'd also add that you should make it clear to the user that the action may not be instantenous.

评论 #41758487 未加载

cletus8 个月前

To paraphase [1]:> Some people, when confronted with a problem, think “I know, I'll use micro-services.” Now they have two problems.As soon as I read this example where there's users and orders microservices, you've already made an error (IMHO). What happens when the traffic becomes such an issue that you need to shard your microservices? Now you've got session and load-balancing issues. If you ignore them, you may break the read-your-write guarantee and that's going to create a huge cost to development.It goes like this: can you read uncommitted changes within your transaction or request? Generally the answer should be "yes". But imagine you need to speak to a sharded service, what happens when you hit a service that didn't do the mutation but it isn't committed yet?A sharded data backend will take you as far as you need to go. If it's good enough for Facebook, it's good enough for you.When I worked at FB, I had a project where someone had come in from Netflix and they fell into the trap many people do of trying to reinvent Netflix architecture at Facebook. Even if the Netflix microservices architecture is an objectively good idea (which I honestly have no opinion on, other than having personally never seen a good solution with microservices), that train has sailed. FB has embraced a different architecture so even if it's objectively good, you're going against established practice and changing what any FB SWE is going to expect when they come across your system.FB has a write through in-memory graph database (called TAO) that writes to sharded MySQL backends. You almost never speak to MySQL directly. You don't even really talk to TAO directly most of the time. There's a data modelling framework on top of it (that enforces privacy and a lot of other things; talk to TAO directly and you'll have a lot of explaining to do). Anyway, TAO makes the read-your-write promise and the proposed microservices broke that. This was pointed out from the very beginning, yet they barreled on through.I can understand putting video encoding into a "service" but I tend to view those as "workers" more than a "service".[1]: <a href="https://regex.info/blog/2006-09-15/247" rel="nofollow">https://regex.info/blog/2006-09-15/247</a>

评论 #41758313 未加载

liampulles8 个月前

This does a good job of encapsulating the considerations of an event driven system.I think the author is a little too easily dismissive of sagas though - for starters, an event driven system is also still going to need to deal with compensating actions, its just going to result in a larger set of events that various systems potentially need to handle.The virtue of a saga or some command driven orchestration approach is that the story of what happens when a user does X is plainly visible. The ability to dictate that upfront and figure it out easily later when diagnosing issues cannot be understated.

latchkey8 个月前

Back in the early 2000's, I was working for the largest hardcore porn company in the world, serving tons of traffic. We built a cluster of 3 Dell 2950 servers with JBoss4. We were using Hibernate and EJB2 entities, with a MySQL backend. This was all before "cloud" allowed porn on their own systems, so we had to do it ourselves.Once configured correctly and all the multicast networking was set up, distributed 2PC transactions via jgroups worked flawlessly for years. We actually only needed one server for all the traffic, but used 3 for redundancy and rolling updates.¯\_(ツ)_/¯, kids these days

评论 #41759315 未加载

physicsguy8 个月前

> And if your product backlog is full and people who designed the microservices are still around, it’s unlikely to happen.Oh man I feel this

junto8 个月前

This is giving me bad memories of MSDTC and Microsoft SQL Server here.

revskill8 个月前

You can have your cake and eat it too by allowing replication.

kunley8 个月前

this is smart, but also: the overall design was so overengineed in the first place..

atombender8 个月前

I find the "forwarder" system here a rather awkward way to bridge the database and Pub/Sub system.A better way to do this, I think, is to ignore the term "transaction," which overloaded with too many concepts (such as transactional isolation), and instead to consider the desired behaviour, namely atomicity: You want two updates to happen together, and (1) if one or both fail you want to retry until they are both successful, and (2) if the two updates cannot both be successfully applied within a certain time limit, they should both be undone, or at least flagged for manual intervention.A solution to both (1) and (2) is to bundle both updates into a single action that you retry. You can execute this with a queue-based system. You don't need an outbox for this, because you don't need to create a "bridge" between the database and the following update. Just use Pub/Sub or whatever to enqueue an "update user and apply discount" action. Using acks and nacks, the Pub/Sub worker system can ensure the action is repeatedly retried until both updates complete as a whole.You can build this from basic components like Redis yourself, or you can use a system meant for this type of execution, such as Temporal.To achieve (2), you extend the action's execution with knowledge about whether it should retry or undo its work. For such a simple action as described above, "undo" means taking away the discount and removing the user points, which are just the opposite of the normal action. A durable execution system such as Temporal can help you do that, too. You simply decide, on error, whether to return a "please retry" error, or roll back the previous steps and return a "permanent failure, don't retry" error.To tie this together with an HTTP API that pretends to be synchronous, have the API handler enqueue the task, then wait for its completion. The completion can be a separate queue keyed by a unique ID, so each API request filters on just that completion event. If you're using Redis, you could create a separate Pub/Sub per request. With Temporal, it's simpler: The API handler just starts a workflow and asks for its result, which is a poll operation.The outbox pattern is better in cases where you simply want to bridge between two data processing systems, but where the consumers aren't known. For example, you want all orders to create a Kafka message. The outbox ensures all database changes are eventually guaranteed to land in Kafka, but doesn't know anything about what happens next in Kafka land, which could be stuff that is managed by a different team within the same company, or stuff related to a completely different part of the app, like billing or ops telemetry. But if your app already knows specifically what should happen (because it's a single app with a known data model), the outbox pattern is unnecessary, I think.

评论 #41756801 未加载

p10jkle8 个月前

This is also a good use case for durable execution, see eg <a href="https://restate.dev" rel="nofollow">https://restate.dev</a>