TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

What Metric to Use When Benchmarking?

30 pointsby dannasalmost 3 years ago

8 comments

ajucalmost 3 years ago
In interactive programs (for example games) it&#x27;s often more important to be consistent than to be fast.<p>For example a game which runs at 120 fps but every 10 seconds has 1 frame that takes 1&#x2F;30th of a seconds feels awful.<p>A game that runs at constant 60 fps feels much better.<p>In this case it&#x27;s better to just count the frames that took too long and by how much.
评论 #31944416 未加载
patrulekalmost 3 years ago
CPU utilization. Modern CPU can execute like 4-6 uops per cycle IIRC. Multiply that by clock freq and number of cores and you get theoretical max. Then take your executed instructions per sec and divide by this theoretical max. The better the ratio, the more efficient your program is.
carlmralmost 3 years ago
How can the 99% confidence interval for time in the first example be 7.391 +- 0.26? Most of the values listed lie outside of that.<p>I got a mean of 7.395 and a sigma of 0.533 (this is without hte DoF adjustment because these are guessed from the histogram). 2.576 * sigma is the 99% confidence interval if we assume a normal distribution. I.e. 1.373.<p>In any case we&#x27;d also have to consider that we estimated the sigma from the distribution, so we&#x27;d have to do an upward correction here: <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Degrees_of_freedom_(statistics)" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Degrees_of_freedom_(statistics...</a>.
评论 #31945880 未加载
评论 #31945637 未加载
menaerusalmost 3 years ago
&gt; My rule of thumb is that when I&#x27;m looking at the performance of an individual function, CPU instructions executed are probably more appropriate<p>Yet, the positive correlation between the CPU time and wall-clock time in real world programs isn&#x27;t guaranteed. E.g. improvement on micro level (function or some excerpt of program) doesn&#x27;t imply or necessitate improvement on the macro level (end-to-end wall-clock performance of whole program).<p>As a matter of fact improving something on a micro level can either negatively impact the end-to-end performance or it doesn&#x27;t have to impact it at all, e.g. performance remains stable.<p>That said, this is still nonetheless an interesting article because it speaks about the stuff which, well, most benchmarketing blogs or engineers never mention. Getting the performance figures that are both statistically significant _and_ reproducible is amazingly difficult feat. Means and confidence intervals cannot help you because they can remain stable and large and you can also use many other similar statistical metrics but you could still be measuring consistently degraded performance of the system because of XYZ reason which is very difficult to identify. More often than not engineers will easy dismiss such benchmarks because they diverge so much from other measurements but the thing is that you don&#x27;t really know what&#x27;s the reason behind such experiment results: it could be a measurement error, it could be a &quot;glitch&quot; in the system whatever that might be (network, disk, kernel bug, etc.) but it could very well be an artifact of the software that you&#x27;re actually benchmarking. Or it could be a combo of these things. Knowing which of these are the culprit for the results you are observing is a very difficult task. I haven&#x27;t managed yet to find a robust approach which doesn&#x27;t involve manual investigation. Think NUMA-aware systems where access to a non-local (remote) global variable can cost you a dozen of cycles so it turns out that the underlying problem actually stems from your code _and_ the way you&#x27;re running the experiments! E.g. <a href="https:&#x2F;&#x2F;www.anandtech.com&#x2F;show&#x2F;16315&#x2F;the-ampere-altra-review&#x2F;3" rel="nofollow">https:&#x2F;&#x2F;www.anandtech.com&#x2F;show&#x2F;16315&#x2F;the-ampere-altra-review...</a>
dan-robertsonalmost 3 years ago
The whole blog is excellent.<p>One issue with looking at instructions retired for small functions is that performance of small functions may be dominated by cache misses (and not having branch predictor data) so two versions may execute a similar number of instructions but have quite different perf due to fewer branches or better memory access patterns. But I guess if you’re optimising that then you’ll know to look at that instead. I guess the moral of ‘get a measurement setup that is good enough to reliably measure the thing you actually care about’ still holds.
ivankellyalmost 3 years ago
The missing dimension here is the percentage of time the program is running on the core. Even if the benchmark is using 3 seconds to run 11billion instructions, that doesn&#x27;t tell you if the core is still idle 90% of the time because it&#x27;s blocking on I&#x2F;O. CPU bound work should pin the core. This is especially true on server side stuff, because if you are not maximizing the time on core, you are paying for CPU that is not being used.
评论 #31944586 未加载
jhoechtlalmost 3 years ago
The SI-metric system please. Imperial units are so impractical.
Shadonototraalmost 3 years ago
Watts