My old team at Google created a tool to help do better browser benchmarking called Tachometer: <a href="https://github.com/google/tachometer">https://github.com/google/tachometer</a><p>It tries to deal with the uncertainties of different browsers, JITs, GCs, CPU throttling, varying hardware, etc., several ways:<p>- Runs benchmarks round-robin to hopefully subject each implementation to varying CPU load and thermal properties evenly.<p>- It reports the confidence interval for an implementation, not the mean. Doesn't throw out outlier samples.<p>- For multiple implementations, compares the distributions of samples, de-emphasizing the mean<p>- For comparisons, reports an NxM difference table, showing how each impl compares to the other.<p>- Can auto-run until confidence intervals for different implementations no longer overlap, giving high confidence that there is an actual difference.<p>- Uses WebDriver to run benchmarks in multiple browsers, also round-robin, and compares results.<p>- Can manage npm dependencies, so you can run the same benchmark with different dependencies and see how different versions change the result.<p>Lit and Preact use Tachometer to tease out performance changes of PRs, even on unreliable GitHub Action hardware. We needed the advanced statistical comparisons exactly because certain things could be faster or slower in different JIT tiers, different browsers, or different code paths.<p>We wanted to be able to test changes that might have small but reliable overall perf impact, in the context of a non-micro-benchmark, and get reliable results.<p>Tachometer is browser-focused, but we made it before there were so many server runtimes. It'd be really interesting to make it run benchmarks against Node, Bun, Deno, etc. too.