Distributed architecture concepts I learned when building a large payment system

441 pointsby gregdoesitabout 7 years ago

14 comments

This post goes into a lot of concepts but doesn't really touch on anything concrete.Also the justification for the Actor model is flimsy compared to the other concepts.I'd love to see an actual diagram of the architecture that was built, and maybe also why sharded + replicated postgres (+/- kafka maybe) wasn't enough to solve the data needs of this application.Assuming they farm out the actual processing of payments (talking to credit card companies) to stripe/paypal/braintree or whatever, all this "payment system" is doing is taking API requests to make payments, calling out to whatever processor they're using, and saving the results, and probably maintaining the state machines for every transaction (making sure that invalid state transitions don't happen).... It sounds like they broke up a monolith but it's completely unclear how they made it better.

评论 #16857788 未加载

评论 #16856376 未加载

fake-nameabout 7 years ago

Kyle Kingsbury/Aphyr's Jepsen distributed DB testing writeups are also a really, really wonderful base for comprehending some of the failings of distributed systems.<a href="https://aphyr.com/tags/Jepsen" rel="nofollow">https://aphyr.com/tags/Jepsen</a> <a href="https://jepsen.io/analyses" rel="nofollow">https://jepsen.io/analyses</a>

评论 #16855169 未加载

estabout 7 years ago

> especially around resharding. Foursquare had a 17 hour downtime in 2010 due to hitting a sharding edge caseso I opened link<a href="http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.html" rel="nofollow">http://highscalability.com/blog/2010/10/15/troubles-with-sha...</a>> What Happened? The problem went something like: Foursquare uses MongoDB to store user data on two EC2 nodes, each of which has 66GB of RAMThere's your problem. Using Mongo sharding in production in 2010.

评论 #16856379 未加载

评论 #16856080 未加载

peterwwillisabout 7 years ago

FWIW, if your system scales horizontally, it also has to scale vertically. You cannot simply add nodes to a cluster into infinity. You often can't even do 5x the number of nodes before performance and availability starts to degrade. The larger the number of nodes, the more difficult to mange, and the more chances for failure.The author writes how they didn't think a mainframe could handle the load in the future. But even super old mainframes still handle their load 40+ years on. They just had better design parameters to make sure they were used appropriately, and they knew how to keep their requirements compact. You may only have 8 alphanumeric characters to store a customer's record name in, but you're never going to have more than 2.8 trillion customers. That hard limit also means you how large the data will get, which makes capacity planning easier. (Even in the cloud you must do capacity planning, if for nothing else, cost projections)

评论 #16857331 未加载

epberryabout 7 years ago

Great overview with lots of links. I came across quite a few concepts I barely know - or know them in a different context. I did trip hard over this sentence:> Distributed systems often have to store a lot more data, than a single node can do so.

评论 #16854303 未加载

inoopabout 7 years ago

Out of curiosity, have you considered using a workflow management system such as AWS Step Functions [1] instead of a set of queues? If so, what made you decide to go with queues?Workflow engines neatly solve a lot of the challenges you mention in the article essentially for free:* Model your problem as a directed graph of (small) dependent operations. Adding new steps does not require adding new queues, just new code.* let the workflow engine manage the state of your transactions (the important, strongly consistent part) and just focus on implementing the logic. Scale horizontally by adding more worker nodes.* get free error handling tools, such as automatic re-tries on failed steps, support for special error handling steps, and metrics you can alarm on (e.g. number of open workflows, number of errored workflows, etc.)* get a free dashboard where you can look up the state of a given transaction, look at error messages, (mass) re-drive failed transactions (e.g. after an outage or after having fixed a bug), etc.[1] <a href="https://aws.amazon.com/step-functions" rel="nofollow">https://aws.amazon.com/step-functions</a>

评论 #16859121 未加载

评论 #16859166 未加载

simplifyabout 7 years ago

I love how they mentioned Stack Overflow successfully scaled vertically. Inspirational.

评论 #16855393 未加载

segmondyabout 7 years ago

I've built a payment system before, I didn't make it distributed and if I was to build one again. I won't make it distributed. I like my payment system old school batch style. Correctness is very important to me and ability to rollback and replay things if things went wrong are very important.

评论 #16859878 未加载

评论 #16859896 未加载

matchagauchoabout 7 years ago

I wish the Author had distinguished which payment system. There are Consumer payments that all Riders are familiar with and then there are Driver payments made at end of day.

评论 #16855477 未加载

评论 #16855135 未加载

评论 #16855399 未加载

stoicmanabout 7 years ago

This article is golden, and not only from few aspects, especially handy if you are learning distributed systems in the current semester, and the subject is not properly covered.Thank you!

susegadoabout 7 years ago

cockroachdb is a distributed SQL database system that solves all the problems mentioned in this article.

mozumderabout 7 years ago

I'll never understand why a payment system needs to scale horizontally.Modern computer systems can scale to 500 or more processor cores. Each core runs billions of instructions per second.A system for a billion accounts on the scale of Uber probably has a million active users at a time, probably a quarter that involving payment.Is Uber saying that they can't support 250k payment transactions per second on the largest system today? That's maybe 1000 transactions per second per core on the largest systems, or about 1ms per transaction. Why is that impossible for them?Or, put it another way, why can't one transaction be completed in less than 1 million CPU instructions?And that's for the very largest company like Uber.. can't even imagine a typical startup needing to scale horizontally for payment processing.

评论 #16855641 未加载

评论 #16855692 未加载

评论 #16855706 未加载

评论 #16855639 未加载

评论 #16856017 未加载

评论 #16858959 未加载

评论 #16855726 未加载

评论 #16856230 未加载

skepticmoronabout 7 years ago

Distributed systems is a service. Isn’t it.

skepticmoronabout 7 years ago

Such a rudimentary take on distributed systems. Why not just create stateless rest services and horizontally scale to a million virtual machines. After all, Uber is a heavily funded company.

14 comments

hardwaresoftonabout 7 years ago

评论 #16857788 未加载

评论 #16856376 未加载

fake-nameabout 7 years ago

评论 #16855169 未加载

estabout 7 years ago

评论 #16856379 未加载

评论 #16856080 未加载

peterwwillisabout 7 years ago

评论 #16857331 未加载

epberryabout 7 years ago

评论 #16854303 未加载

inoopabout 7 years ago

评论 #16859121 未加载

评论 #16859166 未加载

simplifyabout 7 years ago

I love how they mentioned Stack Overflow successfully scaled vertically. Inspirational.

评论 #16855393 未加载

segmondyabout 7 years ago

评论 #16859878 未加载

评论 #16859896 未加载

matchagauchoabout 7 years ago

I wish the Author had distinguished which payment system. There are Consumer payments that all Riders are familiar with and then there are Driver payments made at end of day.

评论 #16855477 未加载

评论 #16855135 未加载

评论 #16855399 未加载

stoicmanabout 7 years ago

This article is golden, and not only from few aspects, especially handy if you are learning distributed systems in the current semester, and the subject is not properly covered.Thank you!

susegadoabout 7 years ago

cockroachdb is a distributed SQL database system that solves all the problems mentioned in this article.

mozumderabout 7 years ago

评论 #16855641 未加载

评论 #16855692 未加载

评论 #16855706 未加载

评论 #16855639 未加载

评论 #16856017 未加载

评论 #16858959 未加载

评论 #16855726 未加载

评论 #16856230 未加载

skepticmoronabout 7 years ago

Distributed systems is a service. Isn’t it.

skepticmoronabout 7 years ago

Such a rudimentary take on distributed systems. Why not just create stateless rest services and horizontally scale to a million virtual machines. After all, Uber is a heavily funded company.