Good Retry, Bad Retry

240 pointsby misonic8 months ago

18 comments

ramchip8 months ago

AWS also say they do something interesting:> When adding jitter to scheduled work, we do not select the jitter on each host randomly. Instead, we use a consistent method that produces the same number every time on the same host. This way, if there is a service being overloaded, or a race condition, it happens the same way in a pattern. We humans are good at identifying patterns, and we're more likely to determine the root cause. Using a random method ensures that if a resource is being overwhelmed, it only happens - well, at random. This makes troubleshooting much more difficult.<a href="https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/" rel="nofollow">https://aws.amazon.com/builders-library/timeouts-retries-and...</a>

评论 #41773379 未加载

guideamigo_com8 months ago

I never get this desire for micro services. You IDE can help if there are 500 functions, but nothing would help you if you have 500 micro services. Almost no one fully understands such a system. Is is hard to argue who parts of code are unused. And large scale refactoring is impossible.The upside seems to be some mythical infinite scalability which will collapse under such positive feedback loops.

评论 #41756474 未加载

评论 #41756370 未加载

评论 #41756661 未加载

评论 #41778888 未加载

评论 #41789528 未加载

评论 #41774820 未加载

评论 #41758115 未加载

评论 #41773233 未加载

评论 #41773752 未加载

评论 #41782671 未加载

jrochkind17 months ago

I just learned quite a bit about retries. I really liked this tour of one area of the domain in the form of a narrative. When written by someone who clearly knows the area and also has skill at writing it, that's a great way to learn more techniques.Would love to read more things like this in different areas.

patrakov8 months ago

To counter the avalanche of retries on different layers, I have also seen a custom header being added to all requests that are retries. Upon receiving a request with this header, the microservice would turn off its own retry logic for this request.

评论 #41773826 未加载

davedx8 months ago

This is the kind of well written, in depth technical narrative I visit HN for. I definitely learned from it. Thanks for posting!

评论 #41756324 未加载

patrakov8 months ago

It's worth noting that the logic in the article only applies to idempotent requests. See this article (by the same author) for the non-idempotent counter-part: <a href="https://habr.com/ru/companies/yandex/articles/442762/" rel="nofollow">https://habr.com/ru/companies/yandex/articles/442762/</a> (unfortunately, in Russian). I am sure somebody posted a human-written English translation back then, but I cannot find it. So here is a Google-translated version (scroll past the internal error, the text is below):<a href="https://habr-com.translate.goog/ru/companies/yandex/articles/442762/?_x_tr_sl=ru&_x_tr_tl=en&_x_tr_hl=en" rel="nofollow">https://habr-com.translate.goog/ru/companies/yandex/articles...</a>

pptr7 months ago

Ideally you can only retry error codes where it is guaranteed that no backend logic has executed yet. This prevents retry amplification. It also has the benefit that you can retry all types of RPCs, including non-idempotent ones. One example is if the server reports that it is overloaded and can't serve requests right now (loadshedding).Without retry amplification you can do retries ASAP, which has much better latency. No exponential backoff required.Retrying deadline exceeded errors seems dangerous. You are amplifying the most expensive requests, so even if you only retry 20% of all RPCs, you could still 10x server load. Ideally you can start loadshedding before the server grinds to a halt (which we can retry without risk of amplification). Having longer RPC deadlines helps the server process the backlog without timeouts. That said, deadline handling is a complex topic and YMMV depending on the service in question.

zackmorris7 months ago

This is probably the most detailed analysis of retry techniques that I've seen. I really appreciated the circuit breaker and deadline propagation sections.But this is why I've pretty much abandoned all connection-oriented logic in favor of declarative programming:<a href="https://en.wikipedia.org/wiki/Declarative_programming" rel="nofollow">https://en.wikipedia.org/wiki/Declarative_programming</a>Loosely, that means that instead of thinking of communication as client-server or peer-to-peer remote procedure calls (RPC), I think of it as state transfer. Specifically, I've moved aware from REST towards things like Firebase that encapsulate retry logic. Under this model, failure is never indicated, apps just hang until communication is reestablished.I actually think that apps can never really achieve 100% reliability, because there's no way to ever guarantee communication:<a href="https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/" rel="nofollow">https://bravenewgeek.com/you-cannot-have-exactly-once-delive...</a><a href="https://en.wikipedia.org/wiki/Byzantine_fault" rel="nofollow">https://en.wikipedia.org/wiki/Byzantine_fault</a><a href="https://en.wikipedia.org/wiki/Two_Generals%27_Problem" rel="nofollow">https://en.wikipedia.org/wiki/Two_Generals%27_Problem</a>Although deadline propagation neatly models the human experience of feeling connected to the internet or not.Also this is why I think that microservices without declarative programming are an evolutionary dead end. So I recommend against starting any new work with them in this era of imperative programming where so many developer hours are lost to managing mutable state. A better way is to use append-only databases like CouchDB which work similarly to Redux to provide the last-known state as the reduction of all previous states.

duffmancd8 months ago

I missed it on the first read-through but there is a link to the code used to run the simulations in the first appendix.Homegrown python code (i.e. not a library), very nicely laid out. And would form a good basis for more experiments for anyone interested. I think I'll have a play around later and try and train my intuition.

easylion8 months ago

Really good article about retries, its consequences and how load amplification works. Loved it

azlev8 months ago

Good reading.In my last job, the service mesh was responsible to do retries. It was a startup and the system was changing every day.After a while, we suspect that some services were not reliable enough and retries were hiding this fact. Turning off retries exposed that in fact, quality went down.In the end, we put retries in just some services.I never tested neither retry budget nor deadline propagation. I will suggest this in the future.

评论 #41773321 未加载

strken7 months ago

I don't understand why load shedding wasn't considered here. My experience has been that it's so effective at making a system recover properly that everything else feels like it's handling an edge case.

评论 #41777871 未加载

评论 #41777155 未加载

sim7c008 months ago

ver nice read with lots of interesting points and examples / examination. very thorough imo. Im not a microservices guy but it gives a lot of general concepts also applicable outside of that domain. very good thanks!

评论 #41777751 未加载

brabarossa7 months ago

Strange architecture. They clearly have a queue, but instead of checking previous request, they create a new one. It's like they managed to get the worst of pub/sub and task queue.

评论 #41777047 未加载

k3vinw8 months ago

Great food for thought! I’m currently on an endeavor at work to stabilize some pre-existing rest service integration tests executed in parallel.

Rygian8 months ago

Reading this excellent article put me in the mind of wondering if job interviews for developer positions include enough questions about queue management."Ben" developed retries without exponential back-off, and only learned about that concept in code review. Exponential back-off should be part of any basic developer curriculum (except if that curriculum does not mention networks of any sort at all).

评论 #41756633 未加载

评论 #41782882 未加载

EDEdDNEdDYFaN7 months ago

can't imagine working at a company with so many competent team members.fun narrative though!

pfyy7 months ago

nteresting reading. I think the article kind of misses the point. The problem was the queuing of requests where nobody was waiting for the response anymore. The same problem would manifest on a monolith with this queuing. If the time to generate the response plus the maximum queue time were shorter than the timeout on the client side, the request amplification would not have happened. The first thing I do on HTTP-based backends is to massively decrease the queue size. This fixes most of these problems. An even better solution would be to be able to purge old requests from the queue, but most frameworks do not allow that, probably due to the Unix socket interface.

18 comments

ramchip8 months ago

评论 #41773379 未加载

guideamigo_com8 months ago

评论 #41756474 未加载

评论 #41756370 未加载

评论 #41756661 未加载

评论 #41778888 未加载

评论 #41789528 未加载

评论 #41774820 未加载

评论 #41758115 未加载

评论 #41773233 未加载

评论 #41773752 未加载

评论 #41782671 未加载

jrochkind17 months ago

patrakov8 months ago

评论 #41773826 未加载

davedx8 months ago

This is the kind of well written, in depth technical narrative I visit HN for. I definitely learned from it. Thanks for posting!

评论 #41756324 未加载

patrakov8 months ago

pptr7 months ago

zackmorris7 months ago

duffmancd8 months ago

easylion8 months ago

Really good article about retries, its consequences and how load amplification works. Loved it

azlev8 months ago

评论 #41773321 未加载

strken7 months ago

评论 #41777871 未加载

评论 #41777155 未加载

sim7c008 months ago

评论 #41777751 未加载

brabarossa7 months ago

Strange architecture. They clearly have a queue, but instead of checking previous request, they create a new one. It's like they managed to get the worst of pub/sub and task queue.

评论 #41777047 未加载

k3vinw8 months ago

Great food for thought! I’m currently on an endeavor at work to stabilize some pre-existing rest service integration tests executed in parallel.

Rygian8 months ago

评论 #41756633 未加载

评论 #41782882 未加载

EDEdDNEdDYFaN7 months ago

can't imagine working at a company with so many competent team members.fun narrative though!

pfyy7 months ago