Java BitSet performance-flicker mystery

113 pointsby cospaiaover 3 years ago

21 comments

My gut reaction is that this[1] looks like a NUMA issue. The flip-flop performance profile immediately reminds me of a very gnarly prod issue i lead the investigation of. I’d want to eliminate NUMA first from my investigation.You can use “numactl --cpunodebind=0 --membind=0 <your original java benchmark command>” to pin the JVM to a particular NUMA node and immediately discount my notion as bunkum.[1] all the reasons mentioned that make me think numa:<pre><code> 1. The machine affected has multiple NUMA domains 2. The M1 is a single NUMA domain chip and does not exhibit the behaviour 3. The real thing making me think this - the consistent performance flip flop is close to the ~40% performance overhead of accessing remote memory vs local that i’ve seen previously. You’re seeing higher overheads but your code is a much tighter loop of memory access than my previous experience so that could explain the difference i think.</code></pre>

评论 #30233963 未加载

jbaiterover 3 years ago

Have you tried setting up a JMH benchmark? This should allow you to see if the JIT is the cause of your slowdowns. Also, running it under a profiler (I recommend async-profiler[1]) should give you a good idea of where the slowdown occurs which might help you pin it down further.[1] <a href="https://github.com/jvm-profiling-tools/async-profiler" rel="nofollow">https://github.com/jvm-profiling-tools/async-profiler</a>

评论 #30228035 未加载

评论 #30227397 未加载

评论 #30231396 未加载

评论 #30228422 未加载

cospaiaover 3 years ago

I'm stumbling onto a strange Java performance irregularity, especially easy to hit with Docker on Linux. A super straightforward implementation of Eratosthenes sieve using a Java BitSet for storage, sometimes drops in performance by 50%. Even more with JDK8. Any JAVA/JDK/JVM experts here that can shed some light on the mystery for me? I've been at it a while, but it is quite a bit outside my field of knowledge, and I am out of ideas for how to proceed. The blog article linked + the Git repository is an attempt to summarize/minimize things.

评论 #30227321 未加载

评论 #30228404 未加载

评论 #30227275 未加载

评论 #30232219 未加载

评论 #30234696 未加载

评论 #30229258 未加载

评论 #30232292 未加载

评论 #30227329 未加载

评论 #30228825 未加载

评论 #30227248 未加载

评论 #30227450 未加载

nsajkoover 3 years ago

First make sure there are no user or kernel tasks that may be hogging resources sometimes. Maybe even try disabling some peripherals and similar, at the hardware or kernel level.Lots of stuff you could try then:* disabling ASLR* disabling Turbo Boost and similar CPU settings* changing the CPU performance scaling governor (from "powersave" to "performance"): printf performance | tee /sys/devices/system/cpu/cpufreq/policy*/scaling_governor* run the code under a real-time scheduling policy (like SCHED_FIFO) with the highest priority. If you do try this, you need to also enable actual real-time scheduling by writing "-1" to /proc/sys/kernel/sched_rt_runtime_us.But modern CPUs are not predictable in their performance, that's why microcontrollers are usually used for hard real time requirements. So I doubt you'll ever be able to get absolutely consistent performance across all benchmark runs.I played similar benchmarking games myself before, and it turns out that, although I did most of the stuff described above, and my code was C++ (no JIT), big slowdowns do happen with some inevitable but predictable frequency.

评论 #30230788 未加载

mibslover 3 years ago

Just a guess, but it kind of sounds like machine code loop alignment could be the cause. Modern CPUs really like their jump targets 32 byte aligned.

评论 #30230786 未加载

kgeistover 3 years ago

My first thought was that it's a bug in the deoptimizer, i.e. the JIT compiler dynamically switches back to a deoptimized form to throw away invalid optimizations (or so it it thinks) to apply new optimizations which are more relevant to the current load/usage patterns. [0]I think I've seen a bug report once about this deoptimization/optimization process happening in an infinite loop, but why would it happen only under Docker on Linux, but not Mac? Perhaps, the deoptimizer relies on heuristics which depend on the current environment.[0] <a href="https://stackoverflow.com/a/20542675" rel="nofollow">https://stackoverflow.com/a/20542675</a>

评论 #30230392 未加载

cbsmithover 3 years ago

It'd be funny if this was just effects from hyperthreading...

评论 #30228476 未加载

srosenbergover 3 years ago

So far my attempts to reproduce the alleged performance degradation have not been fruitful. I've written up a fairly detailed gist [1] on how to get CPU performance metrics; appendix also has a dump of C1 and C2 compiled methods (useful for comparison). I also ran on 2-node NUMA; binding cpu and memory to different nodes didn't yield a repro. either.[1] <a href="https://gist.github.com/srosenberg/41611d5f40cfcbad51aa27eb0dba1af0" rel="nofollow">https://gist.github.com/srosenberg/41611d5f40cfcbad51aa27eb0...</a>

评论 #30237268 未加载

评论 #30237005 未加载

ww520over 3 years ago

Seeing that the benchmarks are running inside Docker, would it be Docker related? Does Docker throttle CPU on different machines differently?Check the temperature of the CPU. Modern CPU will slow down when it runs too hot. Also does anti-virus got run from time to time? Does expensive backup or expensive telemetry got run during the benchmarks?Reading the blog again and seeing the results are dropped to the exact same level step wise, it really looks like the CPU has been throttled or some quota limit has been triggered.

lokedhsover 3 years ago

This issue isn't directly related to BitSet. I have observed the same thing in my programming language interpreter that runs on the JVM (well, it's written in Kotlin multiplatform so it runs on JS and Natively as well).I start the interpreter and measue the time it takes to all all then numbers below 1000000000.The first time I run it after starting the interpreter it always takes 1.4 seconds (within 0.1 second precision). The second time I measure the time it takes 1.7, and for every invocation following that it takes 2 seconds.If I stop the interpreter and try again, I get exactly the same result.I have not been able to explain this behaviour. This is on OpenJDK 11 by the way.If anyone wants to test this, just run the interpreter from here: <a href="https://github.com/lokedhs/array" rel="nofollow">https://github.com/lokedhs/array</a>To run the benchmark, type the following command in the UI:<pre><code> time:measureTime { +/⍳1000000000 } </code></pre> My current best guess is that the optimiser decides to recompile the bytecode to improve performance, but the recompiled version actually ends up being slower.

评论 #30231936 未加载

MattPalmer1086over 3 years ago

I echo the earlier comments to use jmh for benchmarking. There's lots of subtle issues that jmh solves.One thing I notice is that your sieve run doesn't return any values. Java can optimise out code that produces no effect in some circumstances.EDIT: Although in that case, you'd expect to see it run insanely fast. Anyway, the point stands, there's lots of non obvious issues in benchmarking.

archi42over 3 years ago

A lot of your time is spent on the branching and on accessing the memory. So on the ASM level, memory access patterns, caching and branch prediction will affect your performance.My bet is on the branch predictor, since IIRC AMD has a novel branch predictor that's pretty different from the Intel branch predictor (not sure about M1): In C-land you should try loop unrolling (in fact a decent compiler will do that for you). If the JVM has a control for that, force it; else do it manually and pray the JIT does the right thing.My first intuition was the BitSet's cache & memory behavior, which might also be the case for some ranges of `p`: Internally the BitSet is probably something like a chunk of memory with bitwise indexing. So to change a bit, the machine has to load a bunch of bytes into a register, set the bit, and write that register then back to memory. This is bad(*) if you want to set e.g. bits 0 to 31 in your BitSet, because you now got 64 memory accesses instead of two. This might affect you for small p, but with p >= 64 the access stride will be larger than 64. Thinking about it, in fact, this could be something that might throw of a naive JIT optimizer. I would have to think a little bit on how to improve here were your code written in C, with the JVM I have no idea how you could optimize here. Maybe do two loops, first for p<=64 and the other for p>64.Regarding cache misses, hm, 1M bits are just shy of 64kByte. On any modern machine that should fit into the cache; I would still monitor for cache misses though.(*) Short story: I have a light case of PTSD since I once had to reimplement the storage backend for a Java SQL engine university project because a friend was trying to be too clever and did exactly that.Anyway, interesting question and I wish you best of luck with figuring out what's going on there :)//edited a few times; sorry, I'm bad at writing a comment in on go...

评论 #30232276 未加载

sk5tover 3 years ago

It could be that something else is contending for L1/L2 cache.As others have mentioned, take JMH for a spin. Benchmark after a few hundred thousand iterations warmup.Also as mentioned here, thermal throttling could have a big impact. Maybe you have access to a desktop Xeon or similar?

Snoozusover 3 years ago

Would love to reproduce this, please post the full specs of the machine and os!

评论 #30227721 未加载

评论 #30232369 未加载

stock_toasterover 3 years ago

There have been issues[1] with seccomp. Maybe try with seccomp disabled for that container?<pre><code> --security-opt seccomp:unconfined </code></pre> More info here[2].[1]: <a href="https://github.com/docker/for-linux/issues/738" rel="nofollow">https://github.com/docker/for-linux/issues/738</a>[2]: <a href="http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypython-code-runs.html" rel="nofollow">http://mamememo.blogspot.com/2020/05/cpu-intensive-rubypytho...</a>

评论 #30232065 未加载

dnauticsover 3 years ago

I ran into a similar problem with docker on the same problem in zig; podman and on-metal had no performance regression but docker did. I kind of gave the fuck up on the plummer's sieve shootout at that point because (of this and other performance regressions I was finding, like CPU throttling) I felt like I was fighting the stupid rules of the contest more than I was discovering things about performance.Anyways for the authors: try running it in podman and see if it eliminates the perf regression

renewiltordover 3 years ago

What happens if you invert the test condition? That is, you run it 10k times and then see how long that took rather than for X time and see how many times you could do it?You’re using System.getCurrentTimeMillis() which should be w fast. My first thought was if you were using Instant and then sometimes there’s one call to VM.getNanoTimeAdjustment and sometimes two.Man, this is a tough one. I’ll try when I’m home.

评论 #30231337 未加载

评论 #30227736 未加载

shadowmatterover 3 years ago

Every other theory listed here is far more likely, but I would try changing your loop from using System.currentTimeMillis() to using System.nanoTime(). It's a higher-resolution time source that has no relation to wall clock time but is more appropriate for timings. Classes like Stopwatch from Google's Guava use it.

biehlover 3 years ago

How do you calculate the time?Windows had famously low timer resolution on java for a while. What happens if you run each round for increasingly longer periods?

anonuover 3 years ago

Garbage collection ?

评论 #30229340 未加载

cplusplusfellowover 3 years ago

Got to be honest. This sort of thing is why I dropped Java and went to Golang (with all its warts).

评论 #30230941 未加载