TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Latency numbers

84 点作者 Aeolus98超过 10 年前

7 条评论

nkurz超过 10 年前
Knowing these latency numbers is essential for writing efficient code. But with modern out-of-order processors, it&#x27;s often difficult to gauge how much the latency will hurt throughput without closer analysis. I&#x27;d love if the math for this analysis and the associated hardware limits were also better known. Little&#x27;s Law is a fundament of queuing theory. The original article is very readable: <a href="http://web.mit.edu/sgraves/www/papers/Little&#x27;s%20Law-Published.pdf" rel="nofollow">http:&#x2F;&#x2F;web.mit.edu&#x2F;sgraves&#x2F;www&#x2F;papers&#x2F;Little&#x27;s%20Law-Publish...</a><p>It says that for a system in a stable state, &quot;occupancy = latency x throughput&quot;. Let&#x27;s apply this to one of the latencies in the table: main memory access. An obvious question might be &quot;How many lookups from random locations in RAM can we do per second?&quot; From the formula, it looks like we can calculate this (the &#x27;throughput&#x27;) if we knew both the &#x27;latency&#x27; and the &#x27;occupancy&#x27;.<p>We see from the table that the latency is 100 ns. In reality, it&#x27;s going to vary from ~50 ns to 200 ns depending on whether we are reading from an open row, on whether the TLB needs to be updated, and the offset of the desired data from the start of the cacheline. But 100 ns is a fine estimate for 1600 MHz DDR3.<p>But what about the occupancy? It&#x27;s essentially a measure of concurrency, and equal to the number of lookups that can be &#x27;in flight&#x27; at a time. Knowing the limiting factor for this is essential to being able to calculate the throughput. But oddly, knowledge of what current CPU&#x27;s are capable of in this department doesn&#x27;t seem to be nearly as common as knowledge of the raw the latency.<p>Happily, we don&#x27;t need to know all the limits of concurrency for memory lookups, only the one that limits us first. This usually turns out to be the number of outstanding L1 misses, which in turn is limited by the number of Line Fill Buffers (LFB&#x27;s) or Miss Handling Status Registers (MSHR&#x27;s) (Could someone explain the difference between these two?).<p>Modern Intel chips have about 10 of these per core, which means that each core is limited to having about 10 requests for memory happening in parallel. Plugging that in to Little&#x27;s Law:<p><pre><code> &quot;occupancy = latency x throughput&quot; 10 lookups = 100 ns x throughput throughput = 10 lookups &#x2F; 100 ns throughput = 100,000,000 lookups&#x2F;second </code></pre> At 3.5GHz, this means that you have a budget of about 35 cycles of CPU that you can spend on each lookup. Along with the raw latency, this throughput is a good maximum to keep in mind too.<p>It&#x27;s often difficult to sustain this rate, though, since it depends on having the full number of memory lookups in flight at all times. If you have any failed branch predictions, the lookups in progress will be restarted, and your throughput will drop a lot. To achieve the full potential of 100,000,000 lookups per second per core, you either need to be branchless or perfectly predicted.
Klinky超过 10 年前
Any document relating benchmarks or performance numbers should include the exact make&#x2F;model of the hardware involved. So often I see performance numbers reported by developers without much detail on the actual hardware being used. It should be painfully obvious that numbers will vary greatly depending on hardware&#x2F;platform.
评论 #8544925 未加载
amelius超过 10 年前
Related and also very interesting: [1]<p>[1] Ulrich Drepper, What Every Programmer Should Know About Memory, <a href="http://www.cs.bgu.ac.il/~os142/wiki.files/drepper-2007.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cs.bgu.ac.il&#x2F;~os142&#x2F;wiki.files&#x2F;drepper-2007.pdf</a>
thrownaway2424超过 10 年前
The &quot;TCP packet retransmit&quot; one is interesting, because it&#x27;s a parameter you can set in your socket library or kernel. On Linux the default minimum RTO is 200ms, even if the RTT of the connection is &lt; 1ms. For local networking you really, really want to reduce the minimum RTO to a much smaller number. If you don&#x27;t, random packet loss is going to dominate your tail latency.
评论 #8544169 未加载
vowelless超过 10 年前
&gt; Lets multiply all these durations by a billion:<p>This was great at helping me develop a better intuition for the numbers. Thanks!
Oculus超过 10 年前
Earlier in the summer I decided to create a phone background with these numbers so whenever I had free time I could work on memorizing them: <a href="https://twitter.com/EmilStolarsky/status/496298288325599233" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;EmilStolarsky&#x2F;status&#x2F;496298288325599233</a>
adsche超过 10 年前
<p><pre><code> Send 2K bytes over 1 Gbps network ....... 20,000 ns = 20 µs </code></pre> Any reason why (arbitrarily?) take 2K here and not 1K?
评论 #8545299 未加载
评论 #8544863 未加载