JavaScript Benchmarking Is a Mess

133 点作者 joseneca5 个月前

23 条评论

(I designed JavaScriptCore's optimizing JITs and its garbage collector and a bunch of the runtime. And I often benchmark stuff.)Here's my advice for how to run benchmarks and be happy with the results.- Any experiment you perform has the risk of producing an outcome that misleads you. You have to viscerally and spiritually accept this fact if you run any benchmarks. Don't rely on the outcome of a benchmark as if it's some kind of Truth. Even if you do everything right, there's something like a 1/10 risk that you're fooling yourself. This is true for any experiment, not just ones involving JavaScript, or JITs, or benchmarking.- Benchmark large code. Language implementations (including ahead of time compilers for C!) have a lot of "winning in the average" kind of optimizations that will kick in or not based on heuristics, and those heuristics have broad visibility into large chunks of your code. AOTs get there by looking at the entire compilation unit, or sometimes even your whole program. JITs get to see a random subset of the whole program. So, if you have a small snippet of code then the performance of that snippet will vary wildly depending on how it's used. Therefore, putting some small operation in a loop and seeing how long it runs tells you almost nothing about what will happen when you use that snippet in anger as part of a larger program.How do you benchmark large code? Build end-to-end benchmarks that measure how your whole application is doing perf-wise. This is sometimes easy (if you're writing a database you can easily benchmark TPS, and then you're running the whole DB impl and not just some small snippet of the DB). This is sometimes very hard (if you're building UX then it can be hard to measure what it means for your UX to be responsive, but it is possible). Then, if you want to know whether some function should be implemented one way or another way, run an A:B test where you benchmark your whole app with one implementation versus the other.Why is that better? Because then, you're measuring how your snippet of code is performing in the context of how it's used, rather than in isolation. So, your measurement will account for how your choices impact the language implementation's heuristics.Even then, you might end up fooling yourself, but it's much less likely.

评论 #42503316 未加载

评论 #42505935 未加载

评论 #42503005 未加载

评论 #42503783 未加载

评论 #42504570 未加载

评论 #42507471 未加载

评论 #42506892 未加载

spankalee5 个月前

My old team at Google created a tool to help do better browser benchmarking called Tachometer: <a href="https://github.com/google/tachometer">https://github.com/google/tachometer</a>It tries to deal with the uncertainties of different browsers, JITs, GCs, CPU throttling, varying hardware, etc., several ways:- Runs benchmarks round-robin to hopefully subject each implementation to varying CPU load and thermal properties evenly.- It reports the confidence interval for an implementation, not the mean. Doesn't throw out outlier samples.- For multiple implementations, compares the distributions of samples, de-emphasizing the mean- For comparisons, reports an NxM difference table, showing how each impl compares to the other.- Can auto-run until confidence intervals for different implementations no longer overlap, giving high confidence that there is an actual difference.- Uses WebDriver to run benchmarks in multiple browsers, also round-robin, and compares results.- Can manage npm dependencies, so you can run the same benchmark with different dependencies and see how different versions change the result.Lit and Preact use Tachometer to tease out performance changes of PRs, even on unreliable GitHub Action hardware. We needed the advanced statistical comparisons exactly because certain things could be faster or slower in different JIT tiers, different browsers, or different code paths.We wanted to be able to test changes that might have small but reliable overall perf impact, in the context of a non-micro-benchmark, and get reliable results.Tachometer is browser-focused, but we made it before there were so many server runtimes. It'd be really interesting to make it run benchmarks against Node, Bun, Deno, etc. too.

评论 #42504420 未加载

dan-robertson5 个月前

Re VM warmup, see <a href="https://tratt.net/laurie/blog/2022/more_evidence_for_problems_in_vm_warmup.html" rel="nofollow">https://tratt.net/laurie/blog/2022/more_evidence_for_problem...</a> and the linked earlier research for some interesting discussion. Roughly, there is a belief when benchmarking that one can work around not having the most-optimised JIT-compiled version by running your benchmark a number of times and then throwing away the result before doing ‘real’ runs. But it turns out that:(a) sometimes the jit doesn’t run(b) sometimes it makes performance worse(c) sometimes you don’t even get to a steady state with performance(d) and obviously in the real world you may not end up with the same jitted version that you get in your benchmarks

评论 #42502022 未加载

评论 #42502147 未加载

评论 #42502046 未加载

vitus5 个月前

> This effort, along with a move to prevent timing attacks, led to JavaScript engines intentionally making timing inaccurate, so hackers can’t get precise measurements of the current computers performance or how expensive a certain operation is.The primary motivation for limiting timer resolution was the rise of speculative execution attacks (Spectre / Meltdown), where high-resolution timers are integral for differentiating between timings within the memory hierarchy.<a href="https://github.com/google/security-research-pocs/tree/master/spectre.js">https://github.com/google/security-research-pocs/tree/master...</a>If you look at when various browsers changed their timer resolutions, it's entirely a response to Spectre.<a href="https://blog.mozilla.org/security/2018/01/03/mitigations-landing-new-class-timing-attack/" rel="nofollow">https://blog.mozilla.org/security/2018/01/03/mitigations-lan...</a><a href="https://issues.chromium.org/issues/40556716" rel="nofollow">https://issues.chromium.org/issues/40556716</a> (SSCA -> "speculative side channel attacks")

blacklion5 个月前

Very strange take on "JIT introduce a lot of error into result". I'm from JVM/Java world, but it is JITted VM too, and in our world question is: why you want to benchmark interpreted code at all!?Only final-stage, fully-JIT-ted and profile-optimized code is what matter.Short-lived interpreted / level-1 JITted code is not interesting at all from benchmarking perspective, because it will be compiled fast enough to doesn't matter in grand scheme of things.

评论 #42502117 未加载

评论 #42501924 未加载

评论 #42502637 未加载

评论 #42502344 未加载

评论 #42505139 未加载

评论 #42502118 未加载

pygy_5 个月前

I have been sleeping on this for quite a while (long covid is a bitch), but I have built a benchmarking lib that sidesteps quite a few of these problems, by- running the benchmark in thin slices, interspersed and suffled, rather than in one big batch per item (which also avoids having one scenario penalized by transient noise)- displaying a graphs that show possible multi-modal distributions when the JIT gets in the way- varying the lengths of the thin slices between run to work around the poor timer resolution in browsers- assigning the results of the benchmark to a global (or a variable in the parent scope as it is in the WEB demo below) avoid dead code eliminationThis isn't a panacea, but it is better than the existing solutions AFA I'm aware.There are still issues because, sometimes, even if the task order is shuffled for each slice, the literal source order can influence how/if a bit of code is compiled, resulting in unreliable results. The "thin slice" approach can also dilute the GC runtime between scenarios if the amount of garbage isn't identical between scenarios.I think it is, however, a step in the right direction.- CLI runner for NODE: <a href="https://github.com/pygy/bunchmark.js/tree/main/packages/cli">https://github.com/pygy/bunchmark.js/tree/main/packages/cli</a>- WIP WEB UI: <a href="https://flems.io/https://gist.github.com/pygy/3de7a5193989e09528975d3e81130d7c" rel="nofollow">https://flems.io/https://gist.github.com/pygy/3de7a5193989e0...</a>In both case, if you've used JSPerf you should feel right at home in the WEB UI. The CLI UI is meant to replicate the WEB UI as close as possible (see the example file).

评论 #42502621 未加载

ericyd5 个月前

Maybe I'm doing it wrong, but when I benchmark code, my goal is to compare two implementations of the same function and see which is faster. This article seems to be concerned with finding some absolute metric of performance, but to me that isn't what benchmarking is for. Performance will vary based on hardware and runtime which often aren't in your control. The limitations described in this article are interesting notes, but I don't see how they would stop me from getting a reasonable assessment of which implementation is faster for a single benchmark.

评论 #42502439 未加载

评论 #42503229 未加载

评论 #42507167 未加载

评论 #42504103 未加载

评论 #42502542 未加载

skybrian5 个月前

Performance is inherently non-portable. In fact, ignoring performance differences is what enables portability.Not knowing what performance to expect is what allows you to build a website and expect it to run properly years later, on browsers that haven’t been released yet, running on future mobile phones that use chips that haven’t been designed yet, over a half-working WiFi connection in some cafe somewhere.Being ignorant of performance is what allows you to create Docker images that work on random servers in arbitrary datacenters, at the same time that perfect strangers are running their jobs and arbitrarily changing what hardware is available for your code to use.It’s also what allows you to depend on a zillion packages written by others and available for free, and upgrade those packages without things horribly breaking due to performance differences, at least most of the time.If you want fixed performance, you have to deploy on fixed, dedicated hardware, like video game consoles or embedded devices, and test on the same hardware that you’ll use in production. And then you drastically limit your audience. It’s sometimes useful, but it’s not what the web is about.But faster is better than slower, so we try anyway. Understanding the performance of portable code is a messy business because it’s mostly not the code, it’s our assumptions about the environment.We run tests that don’t generalize. For scientific studies, this is called the “external validity” problem. We’re often doing the equivalent of testing on mice and assuming the results are relevant for humans.

评论 #42502558 未加载

CalChris5 个月前

Laurence Tratt's paper Virtual machine warmup blows hot and cold [1] paper has been posted several times and never really discussed. It covers this problem for Java VMs and also presents a benchmarking methodology.[1] <a href="https://dl.acm.org/doi/10.1145/3133876" rel="nofollow">https://dl.acm.org/doi/10.1145/3133876</a>

diggan5 个月前

> Essentially, these differences just mean you should benchmark across all engines that you expect to run your code to ensure code that is fast in one isn’t slow in another.In short, the JavaScript backend people now need to do what we JavaScript frontend people been doing since SPAs became a thing, run benchmarks across multiple engines instead of just one.

hyperpape5 个月前

For anyone interested in this subject, I’d recommend reading about JMH. The JVM isn’t 100% the same as JS VMs, but as a benchmarking environment it shares the same constraint of JIT compilation.The right design is probably one that:1) runs different tests in different forked processes, to avoid variance based on the order in which tests are run changing the JIT’s decisions.2) runs tests for a long time (seconds or more per test) to ensure full JIT compilation and statistically meaningful resultsThen you need to realize that your micro benchmarks give you information and help you understand, but the acid test is improving the performance of actual code.

henning5 个月前

While there may be challenges, caring about frontend performance is still worth it. When I click the Create button in JIRA and start typing, the text field lags behind my typing. I use a 2019 MacBook Pro. Unforgivable. Whether one alternate implementation that lets me type normally is 10% faster than another or not or whatever may be harder to answer. If I measure how bad the UI is and it's actually 60x slower than vanilla JS rather than 70x because of measurement error, the app is still a piece of shit.

austin-cheney5 个月前

There is a common sentiment I see there that I see regularly repeated in software. Here is my sarcastic take:I hate measuring things because accuracy is hard. I wish I could just make up my own numbers to make myself feel better.It is surprising to me how many developers cannot measure things, do so incorrectly, and then look for things to blame for their emotional turmoil.Here is quick guide to solve for this:1. Know what you are measuring and what its relevance is to your product. It is never about big or small because numerous small things make big things.2. Measuring things means generating numbers and comparing those numbers against other numbers from a different but similar measure. The numbers are meaningless is there is no comparison.3. If precision is important use the high performance tools provided by the browser and Node for measuring things. You can get greater than nanosecond precision and then account for the variance, that plus/minus range, in your results. If you are measuring real world usage and your numbers get smaller, due to performance refactoring, expect variance to increase. It’s ok, I promise.4. Measure a whole bunch of different shit. The point of measuring things isn’t about speed. It’s about identifying bias. The only way to get faster is to know what’s really happening and just how off base your assumptions are.5. Never ever trust performance indicators from people lacking objectivity. Expect to have your results challenged and be glad when they are. Rest on the strength of your evidence and ease of reproduction that you provide.

notnullorvoid5 个月前

Something I see many JS benchmarks struggle with is GC. Benchmarks run in tight loops with no GC in between, leading to results not representative of real world use. Rarely do they even run GC between the different test cases, so earlier cases build up GC pressure negatively impacting the later cases and invalidating all results.

评论 #42511288 未加载

xpl5 个月前

I once created this tool for benchmarking JS: <a href="https://github.com/xpl/what-code-is-faster">https://github.com/xpl/what-code-is-faster</a>It does JIT warmup and ensures that your code doesn't get optimized out (by making it produce a side effect in result).

evnwashere5 个月前

That’s why i created mitata, it greatly improves on javascript (micro-)benchmarking toolingit provides bunch of features to help avoiding jit optimization foot-guns during benchmarking and dips into more advanced stuff like hardware cpu counters to see what’s the end result of jit on cpu

croes5 个月前

Do the users care?I think they are used to waiting because they no longer know the speed of desktop applications.

评论 #42503615 未加载

1oooqooq5 个月前

kids dont recall when chrome was cheating left and right to be faster than firefox (after they were honestly for a couple months).you'd have to run benchmarks for all sort of little thibgs because no browser would leave things be. If they thought one popular benchmark was using string+string it was all or nothing to optimize that, harming everything else. next week if that benchmark changed to string[].join... you get the idea. your code was all over the place in performance. Flying today, molasses next week... sometimes chrome and ff would switch the optimizations, so you'd serve string+string to one and array.join to the other. sigh.

sylware5 个月前

If you use javascript, use a lean engine coded in a lean SDK, certainly not the c++ abominations in Big Tech web engines.Look at quickjs, and use your own very lean OS interfaces.

评论 #42501948 未加载

sroussey5 个月前

For the love of god, please do not do this example:<pre><code> for (int i = 0; i<1000; i++) { console.time() // do some expensive work console.timeEnd() } </code></pre> Take your timing before and after the loop and divide by the count. Too much jitter otherwise.d8 and node have many options for benchmarking and if you really care, go command line. JSC is what is behind Bun so you can go that direction as well.And BTW: console.time et al does a bunch of stuff itself. You will get the JIT looking to optimize it as well in that loop above, lol.

评论 #42503026 未加载

thecodrr5 个月前

Benchmarking is a mess everywhere. Sure you can get some level of accuracy but reproducing any kind of benchmark results across machines is impossible. That's why perf people focus on things like CPU cycles, heap size, cache access etc instead of time. Even with multiple runs and averaged out results you can only get a surface level idea of how your code is actually performing.

评论 #42502326 未加载

DJBunnies5 个月前

The whole JavaScript ecosystem is a mess.

评论 #42503420 未加载

gred5 个月前

If you find yourself benchmarking JavaScript, you chose the wrong language.