This post goes into a lot of concepts but doesn't really touch on anything concrete.<p>Also the justification for the Actor model is flimsy compared to the other concepts.<p>I'd love to see an actual diagram of the architecture that was built, and maybe also why sharded + replicated postgres (+/- kafka <i>maybe</i>) wasn't enough to solve the data needs of this application.<p>Assuming they farm out the actual processing of payments (talking to credit card companies) to stripe/paypal/braintree or whatever, all this "payment system" is doing is taking API requests to make payments, calling out to whatever processor they're using, and saving the results, and probably maintaining the state machines for every transaction (making sure that invalid state transitions don't happen).... It sounds like they broke up a monolith but it's completely unclear <i>how</i> they made it better.
Kyle Kingsbury/Aphyr's Jepsen distributed DB testing writeups are also a really, really wonderful base for comprehending some of the failings of distributed systems.<p><a href="https://aphyr.com/tags/Jepsen" rel="nofollow">https://aphyr.com/tags/Jepsen</a>
<a href="https://jepsen.io/analyses" rel="nofollow">https://jepsen.io/analyses</a>
> especially around resharding. Foursquare had a 17 hour downtime in 2010 due to hitting a sharding edge case<p>so I opened link<p><a href="http://highscalability.com/blog/2010/10/15/troubles-with-sharding-what-can-we-learn-from-the-foursquare.html" rel="nofollow">http://highscalability.com/blog/2010/10/15/troubles-with-sha...</a><p>> What Happened? The problem went something like: Foursquare uses MongoDB to store user data on two EC2 nodes, each of which has 66GB of RAM<p>There's your problem. Using Mongo sharding in production in 2010.
FWIW, if your system scales horizontally, it also has to scale vertically. You cannot simply add nodes to a cluster into infinity. You often can't even do 5x the number of nodes before performance and availability starts to degrade. The larger the number of nodes, the more difficult to mange, and the more chances for failure.<p>The author writes how they didn't think a mainframe could handle the load in the future. But even super old mainframes still handle their load 40+ years on. They just had better design parameters to make sure they were used appropriately, and they knew how to keep their requirements compact. You may only have 8 alphanumeric characters to store a customer's record name in, but you're never going to have more than 2.8 trillion customers. That hard limit also means you how large the data will get, which makes capacity planning easier. (Even in the cloud you must do capacity planning, if for nothing else, cost projections)
Great overview with lots of links. I came across quite a few concepts I barely know - or know them in a different context. I did trip hard over this sentence:<p>> Distributed systems often have to store a lot more data, than a single node can do so.
Out of curiosity, have you considered using a workflow management system such as AWS Step Functions [1] instead of a set of queues? If so, what made you decide to go with queues?<p>Workflow engines neatly solve a lot of the challenges you mention in the article essentially for free:<p>* Model your problem as a directed graph of (small) dependent operations. Adding new steps does not require adding new queues, just new code.<p>* let the workflow engine manage the state of your transactions (the important, strongly consistent part) and just focus on implementing the logic. Scale horizontally by adding more worker nodes.<p>* get free error handling tools, such as automatic re-tries on failed steps, support for special error handling steps, and metrics you can alarm on (e.g. number of open workflows, number of errored workflows, etc.)<p>* get a free dashboard where you can look up the state of a given transaction, look at error messages, (mass) re-drive failed transactions (e.g. after an outage or after having fixed a bug), etc.<p>[1] <a href="https://aws.amazon.com/step-functions" rel="nofollow">https://aws.amazon.com/step-functions</a>
I've built a payment system before, I didn't make it distributed and if I was to build one again. I won't make it distributed. I like my payment system old school batch style. Correctness is very important to me and ability to rollback and replay things if things went wrong are very important.
I wish the Author had distinguished <i>which</i> payment system. There are <i>Consumer</i> payments that all Riders are familiar with and then there are <i>Driver</i> payments made at end of day.
This article is golden, and not only from few aspects, especially handy if you are learning distributed systems in the current semester, and the subject is not properly covered.<p>Thank you!
I'll never understand why a payment system needs to scale horizontally.<p>Modern computer systems can scale to 500 or more processor cores. Each core runs billions of instructions per second.<p>A system for a billion accounts on the scale of Uber probably has a million active users at a time, probably a quarter that involving payment.<p>Is Uber saying that they can't support 250k payment transactions per second on the largest system today? That's maybe 1000 transactions per second per core on the largest systems, or about 1ms per transaction. Why is that impossible for them?<p>Or, put it another way, why can't one transaction be completed in less than 1 million CPU instructions?<p>And that's for the very largest company like Uber.. can't even imagine a typical startup needing to scale horizontally for payment processing.
Such a rudimentary take on distributed systems. Why not just create stateless rest services and horizontally scale to a million virtual machines. After all, Uber is a heavily funded company.