I prefer reporting the mean and the standard deviation - the paper advocates a confidence interval instead of standard deviation. Typically, I'm more concerned with the <i>spread</i> of obtained performance values than I am with how likely it is that our measured mean is the within some interval. I generally don't think of that spread of obtained values as noise or random errors, but as systematic consequences of using real computing systems. The reason I don't consider that systematic <i>error</i> is that the sources of variation in real computer systems are often the result of things like memory hierarchies and system buffers that will exist in practice. Real systems will have these things, so I want my experiments to have them as well - so long as our benchmark has them in the same way a real production system will have them.<p>For example, see Table 2 in a recent paper I am a co-author on (page 8 of the pdf, page 73 using the proceedings numbering): <a href="http://www.scott-a-s.com/files/debs2017_daba.pdf" rel="nofollow">http://www.scott-a-s.com/files/debs2017_daba.pdf</a> In this paper, we care about latency, and we report the average latency along with the standard deviation. Here, a tighter standard deviation is more important than confidence that the mean falls within a particular range. And the variation in latencies is caused by both software and hardware realities of the memory hierarchy.
More recently:<p>"Quantifying performance changes with effect size confidence intervals"
Tomas Kalibera and Richard Jones
Technical Report 4-12, University of Kent, June 2012.<p><a href="https://www.cs.kent.ac.uk/pubs/2012/3233/" rel="nofollow">https://www.cs.kent.ac.uk/pubs/2012/3233/</a><p>Kalibera, Tomas and Jones, Richard E. (2013) "Rigorous Benchmarking in Reasonable Time"<p><a href="https://kar.kent.ac.uk/33611/" rel="nofollow">https://kar.kent.ac.uk/33611/</a>
SPECjvm98 is an outdated measure of both system and JVM performance, the benchmark to look at is SPECjbb2015 which very aggressively taxes JVM subsystems like the GC and the JIT.