> a utilization of 0.90<p>> So we went from a latency of 1 second to 0.091 seconds which is an 11 times improvement.<p>There's your problem -- you should never allow unbounded queue growth at high utilization. Going from 80% to 90% utilization doubles your average wait times. We could similarly make this number arbitrarily large by pushing that utilization closer to 1, e.g. "We halved service time at 99.99% utilization and saw a 10000x improvement". But that's not interesting, as your users will complain that your service is unusable under heavy load far before you get to that point.<p>The typical "fix" is to add load shedding (e.g. based on queue length) combined with some intelligent backoff logic at the client (to reduce congestion), and call it a day. This has its own downsides, e.g. increased latency for everyone in cases of overload. Or, if your configured queue length is too large, you get bufferbloat.<p>(I have seen an argument for using LIFO instead of FIFO, which achieves much more predictable median performance at the expense of causing unbounded badness in the worst case. For this, your client needs to set deadlines, which it should probably be doing anyways.)
Another of those somewhat counter-intuitive results is the answer to the question "how much do we need to scale up to avoid response time regression when the request rate is doubled?"<p>It is very easy to blurt out "well obviously we need twice the processing power!" but if we scale to twice the processing power, then start accepting twice the request rate – we will actually be serving each request in half the time we originally did.<p>To many people that sounds weird; it sounds like we got something for nothing. If I invite twice as many people to a party and buy twice as many cookies, it's not like each guest will get twice as many cookies – that just leaves the originally planned number of cookies for each guest.<p>But for response time it comes back to the first equation in TFA:<p><pre><code> T = 1/μ · 1/(1 - ρ)
</code></pre>
Doubling both arrival rate and maximum service rate leaves ρ – and the second factor with it – unchanged, but still halves the 1/μ factor, resulting in half the response time.<p>The appropriate amount to scale by is the k that solves the equation we get when we set the old response time T at request rate λ equal to the one at request rate 2λ and kμ processing power. This is something like<p><pre><code> T = 1/μ · 1/(1 - λ/μ) = 1/kμ · 1/(1 - 2λ/kμ)
</code></pre>
but rearranges to the much simpler<p><pre><code> k = ρ + 1
</code></pre>
which a colleague of mine told me interprets intuitively as "the processing power that needs to be added is exactly that which will handle an additional unit of the current utilisation on top of the current utilisation, i.e. twice as much."<p>This is mostly good news for people doing capacity planning in advance of events etc. If you run your systems at reasonable utilisation levels normally, you don't actually need that much additional capacity to handle peak loads.
<p><pre><code> In mathematical queueing theory, Little's law (also result, theorem, lemma, or formula[1][2]) is a theorem by John Little which states that the long-term average number L of customers in a stationary system is equal to the long-term average effective arrival rate λ multiplied by the average time W that a customer spends in the system. Expressed algebraically the law is
L=λW
The relationship is not influenced by the arrival process distribution, the service distribution, the service order, or practically anything else. In most queuing systems, service time is the bottleneck that creates the queue.
</code></pre>
<a href="https://en.m.wikipedia.org/wiki/Little%27s_law" rel="nofollow">https://en.m.wikipedia.org/wiki/Little%27s_law</a><p>An extremely useful law to remember. You’d be surprised how much bullshit it can detect!
I am a bit worried of the overuse of Little's formula, especially in these catchy educational presentations. In reality queue sizes will dictate your consumer observable latency profile, which is in turn is dictated by the actual distribution of the service time - it is not a constant.<p>If you think about it, if you have an ideal system that serves users like a clockwork, every X ms with no jitter, while your arrival is also completely regular, every Y ms (Y < X), then basically a queue length of 1 is sufficient. In reality, just like we all observe in real-life queues, service time is far from constant, and outliers result in queue buildup. This is why often cutting the tail of service-time latency results in better overall latency than simply reducing average service-time.<p>Little's formula of course holds also in the above scenario, but it handles long-time averages and does not give you any indication what extreme behavior is lurking under the mask of these averages.
Queueing theory refresher for software developers:<p><a href="https://github.com/joelparkerhenderson/queueing-theory">https://github.com/joelparkerhenderson/queueing-theory</a><p>This can help put the article in context. The author writes "wanted to freshen up my queuing theory a bit".
The math seems to be off in a few places:<p>1. When solving for the arrival rate λ, it should be λ=ρ/S instead of λ=S/ρ.<p>2. Calculating the arrival rate λ with the correct formula results in 9 requests/second instead of 10 request/second.<p>3. Calculating the new utilization ρ should be 0.05x9 instead of 0.05x10.<p>The result of the calculated utilization ρ is correct (0.45), so is the calculated new latency.<p>edit: new lines
What is service time?<p>Edit: service time is processing time. 6:35 in <a href="https://youtu.be/03GsLxVdVzU" rel="nofollow">https://youtu.be/03GsLxVdVzU</a>
> a utilization of 0.90<p>Industry standard utilization is 50-60%, not 90%. If it is 90% utilization, you are destined for failure as 10% extra requests even for few seconds would kill your service. And it should be 60% even if service time increases or decreases, so decreasing the service time by half should ideally halve the resources.