AWS also say they do something interesting:<p>> When adding jitter to scheduled work, we do not select the jitter on each host randomly. Instead, we use a consistent method that produces the same number every time on the same host. This way, if there is a service being overloaded, or a race condition, it happens the same way in a pattern. We humans are good at identifying patterns, and we're more likely to determine the root cause. Using a random method ensures that if a resource is being overwhelmed, it only happens - well, at random. This makes troubleshooting much more difficult.<p><a href="https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/" rel="nofollow">https://aws.amazon.com/builders-library/timeouts-retries-and...</a>
I never get this desire for micro services. You IDE can help if there are 500 functions, but nothing would help you if you have 500 micro services. Almost no one fully understands such a system. Is is hard to argue who parts of code are unused. And large scale refactoring is impossible.<p>The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.
I just learned quite a bit about retries. I really liked this tour of one area of the domain in the form of a narrative. When written by someone who clearly knows the area and also has skill at writing it, that's a great way to learn more techniques.<p>Would love to read more things like this in different areas.
To counter the avalanche of retries on different layers, I have also seen a custom header being added to all requests that are retries. Upon receiving a request with this header, the microservice would turn off its own retry logic for this request.
It's worth noting that the logic in the article only applies to idempotent requests. See this article (by the same author) for the non-idempotent counter-part: <a href="https://habr.com/ru/companies/yandex/articles/442762/" rel="nofollow">https://habr.com/ru/companies/yandex/articles/442762/</a> (unfortunately, in Russian). I am sure somebody posted a human-written English translation back then, but I cannot find it. So here is a Google-translated version (scroll past the internal error, the text is below):<p><a href="https://habr-com.translate.goog/ru/companies/yandex/articles/442762/?_x_tr_sl=ru&_x_tr_tl=en&_x_tr_hl=en" rel="nofollow">https://habr-com.translate.goog/ru/companies/yandex/articles...</a>
Ideally you can only retry error codes where it is guaranteed that no backend logic has executed yet.
This prevents retry amplification.
It also has the benefit that you can retry all types of RPCs, including non-idempotent ones.
One example is if the server reports that it is overloaded and can't serve requests right now (loadshedding).<p>Without retry amplification you can do retries ASAP, which has much better latency. No exponential backoff required.<p>Retrying deadline exceeded errors seems dangerous. You are amplifying the most expensive requests, so even if you only retry 20% of all RPCs, you could still 10x server load.
Ideally you can start loadshedding before the server grinds to a halt (which we can retry without risk of amplification).
Having longer RPC deadlines helps the server process the backlog without timeouts.
That said, deadline handling is a complex topic and YMMV depending on the service in question.
This is probably the most detailed analysis of retry techniques that I've seen. I really appreciated the circuit breaker and deadline propagation sections.<p>But this is why I've pretty much abandoned all connection-oriented logic in favor of declarative programming:<p><a href="https://en.wikipedia.org/wiki/Declarative_programming" rel="nofollow">https://en.wikipedia.org/wiki/Declarative_programming</a><p>Loosely, that means that instead of thinking of communication as client-server or peer-to-peer remote procedure calls (RPC), I think of it as state transfer. Specifically, I've moved aware from REST towards things like Firebase that encapsulate retry logic. Under this model, failure is never indicated, apps just hang until communication is reestablished.<p>I actually think that apps can never really achieve 100% reliability, because there's no way to ever guarantee communication:<p><a href="https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/" rel="nofollow">https://bravenewgeek.com/you-cannot-have-exactly-once-delive...</a><p><a href="https://en.wikipedia.org/wiki/Byzantine_fault" rel="nofollow">https://en.wikipedia.org/wiki/Byzantine_fault</a><p><a href="https://en.wikipedia.org/wiki/Two_Generals%27_Problem" rel="nofollow">https://en.wikipedia.org/wiki/Two_Generals%27_Problem</a><p>Although deadline propagation neatly models the human experience of feeling connected to the internet or not.<p>Also this is why I think that microservices without declarative programming are an evolutionary dead end. So I recommend against starting any new work with them in this era of imperative programming where so many developer hours are lost to managing mutable state. A better way is to use append-only databases like CouchDB which work similarly to Redux to provide the last-known state as the reduction of all previous states.
I missed it on the first read-through but there is a link to the code used to run the simulations in the first appendix.<p>Homegrown python code (i.e. not a library), very nicely laid out. And would form a good basis for more experiments for anyone interested. I think I'll have a play around later and try and train my intuition.
Good reading.<p>In my last job, the service mesh was responsible to do retries. It was a startup and the system was changing every day.<p>After a while, we suspect that some services were not reliable enough and retries were hiding this fact. Turning off retries exposed that in fact, quality went down.<p>In the end, we put retries in just some services.<p>I never tested neither retry budget nor deadline propagation. I will suggest this in the future.
I don't understand why load shedding wasn't considered here. My experience has been that it's so effective at making a system recover properly that everything else feels like it's handling an edge case.
ver nice read with lots of interesting points and examples / examination. very thorough imo. Im not a microservices guy but it gives a lot of general concepts also applicable outside of that domain. very good thanks!
Strange architecture. They clearly have a queue, but instead of checking previous request, they create a new one. It's like they managed to get the worst of pub/sub and task queue.
Reading this excellent article put me in the mind of wondering if job interviews for developer positions include enough questions about queue management.<p>"Ben" developed retries without exponential back-off, and only learned about that concept in code review. Exponential back-off should be part of any basic developer curriculum (except if that curriculum does not mention networks of any sort at all).
nteresting reading. I think the article kind of misses the point. The problem was the queuing of requests where nobody was waiting for the response anymore. The same problem would manifest on a monolith with this queuing. If the time to generate the response plus the maximum queue time were shorter than the timeout on the client side, the request amplification would not have happened. The first thing I do on HTTP-based backends is to massively decrease the queue size. This fixes most of these problems. An even better solution would be to be able to purge old requests from the queue, but most frameworks do not allow that, probably due to the Unix socket interface.