"The percentile values we observed had quite unusual distribution"
It is pretty normal to have a long tail in the last few percentile/tenths. Fortunately, most latency monitoring tools do account for this now. Another gotcha is that 99.9 isn't the end of your tail either. Sometimes looking at the "100th percentile" request isn't useful/such an outlier, but you should know it exists.<p>(regarding "Speculative task") Also called "Hedged Request" here in an article called "The Tail at Scale"[1]<p>[1] <a href="http://www-inst.eecs.berkeley.edu/~cs252/sp17/papers/TheTailAtScale.pdf" rel="nofollow">http://www-inst.eecs.berkeley.edu/~cs252/sp17/papers/TheTail...</a>
I have found in many distributed systems the latency of top level requests follows a log-normal distribution, which sounds like what you are describing. But tuning when you launch backup requests you can trade off overhead for reduced tail latencies. After experimenting with it at Twitter I then found a Jeff Dean paper from Google describing it as well. Haven't been able to find that paper again though. Here is his presentation where he discusses backup requests:<p><a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44875.pdf" rel="nofollow">https://static.googleusercontent.com/media/research.google.c...</a>
I've seen this approach appear without a name very often in highly concurrent Haskell, Erlang and Elixir code. Often a "safe" race combinator appears in good libraries that handles graceful termination. So much so that you can often find local libraries that look like this Haskell code:<p><pre><code> speculatively :: Int -> IO a -> IO (Either a a)
speculatively ms action = race action (threadDelay ms >> action)</code></pre>
If your backend is Redis, how is starting more speculative hits to Redis going to help, since it's single threaded?<p>I mean if this is indeed helping it seems that it's not Redis that's the long pole, but maybe the request routing mesh or something.
But wouldn't this solution mean that once there is a small glitch in the system, say due to higher load than normal, there are 2x the requests and the system goes down completely?<p>I don't mean to be negative of the solution, I am just curious.
A week ago I might have thought this would be good for our microservices, as they were already using short timeouts with a retry but abandoning the original request. This is better. However it has limitations and isn't the solution. It suffers from request amplification if the called service also depends on other services and applies the same retry policy. It it still request latency bound. The main takeaway I've learned for bounding microservices is that the inflow of dependent data should be asynchronous and request latency dependent only on the service's data.
This seems like a useful practice. Do you log how often this happens and which task is the winner? It seems like without that, you may not notice actual issues.