3,200% CPU Utilization

493 pointsby atomlib3 months ago

36 comments

Ooof. The core collections in Java are well understood to not be thread-safe by design, and this should have been noticed.OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.>Controversial Fix: Track visited nodesDon't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way .. either now, or in the future (if/when the implementation changes in the next Java release).>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.---------When working with code that is operated on by multiple threads, the only surefire strategy I found was to to make every possible object immutable and limiting any object that could not be made immutable to small, self-contained and highly controlled sections. We rewrote one of the core modules following these principles, and it went from being a constant source of issues, to one of the most resilient sections of our codebase. Having these guidelines in place, also made code reviews much easier.

评论 #43212326 未加载

评论 #43212692 未加载

评论 #43218950 未加载

评论 #43229232 未加载

Someone3 months ago

FTA: The code can be reduced to simply:<pre><code> public void someFunction(SomeType relatedObject, List<SomeOtherType> unrelatedObjects) { ... treeMap.put(relatedObject.a(), relatedObject.b()); ... // unrelatedObjects is used later on in the function so the // parameter cannot be removed } </code></pre> That’s not true. The original code only does the treeMap.put if unrelatedObjects is nonempty. That may or may not be a bug.You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.

评论 #43209362 未加载

layer83 months ago

Another way to get infinite loops is using a Comparator or Comparable implementation that doesn’t implement a consistent total order: <a href="https://stackoverflow.com/questions/62994606/concurrentskipset-compareto-instant-infinite-loop" rel="nofollow">https://stackoverflow.com/questions/62994606/concurrentskips...</a> (This is unrelated to concurrency.)Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.

评论 #43208537 未加载

lucianbr3 months ago

> I always thought of race conditions as corrupting the data or deadlocking. I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.

评论 #43208153 未加载

评论 #43208198 未加载

评论 #43208349 未加载

评论 #43209268 未加载

评论 #43212186 未加载

评论 #43208116 未加载

评论 #43208427 未加载

评论 #43210677 未加载

评论 #43208624 未加载

评论 #43218051 未加载

评论 #43209769 未加载

评论 #43227818 未加载

评论 #43219873 未加载

评论 #43216184 未加载

评论 #43220091 未加载

hn_acc13 months ago

The mention of "could barely ssh in" reminds me of a situation in grad school where our group had a Sun UltraSparc 170 (IIRC) with 1GB HD and 128 or 256 MB of RAM, shared by maybe 8 people in a small research group relating to parallel and distributed processing. Keep in mind, Sun machines were rarely rebooted, ever.So I guess the new user / student was trying to do things in parallel to speed things up when they chopped up their large text file into N (32 or 64) sections based on line number (not separate files), and then ran N copies of perl in parallel, each processing its own set of lines from that one file.Not only did you have a large amount (for back then) of RAM used by N copies of the perl interpreter (separate processes, not threads, mind you!) processing its data, as well, any attempt to swap was interleaved with frantic seeking to a different section of the same file to read a few more lines for one of N processes stalled on IO. Also, probably the Jth process had to read J/N of the entire file to get to its section. So the first section of the file was read N times, the next N-1, then N-2, etc.We (me and the fellow-student-trusted-as-sysadmin who had the root password) couldn't even get a login prompt on the console. Luckily, I had a logged-in session (ssh from an actual x-terminal - a large-screen "dumb" terminal), and "su" prompted for a password after 20-30 minutes of running it. After another 5-10 minutes, we had a root session and were able to run top and figure out what was going on. Killing the offending processes (after trying to contact the user) restored the system back to normal.Edit: forgot to say: had the right idea, but totally didn't understand the system's limitations. It was SEVERELY I/O limited with that hard drive and relatively low RAM, so just processing the data linearly would have easily been the best approach unless the amount of data to be kept would have gotten too large.

kachapopopow3 months ago

Yep, ran into this way too many times. Performing concurrent operations on non thread-safe objects in java or generally in any language produces the most interesting bugs in the world.

评论 #43208027 未加载

评论 #43208480 未加载

评论 #43208879 未加载

评论 #43213152 未加载

评论 #43209542 未加载

bmm6o3 months ago

I've seen this in production C#. Same symptoms, process dump showed a corrupt dictionary. We thought we were using ConcurrentDictionary everywhere but this one came in from a library. We were using .net framework, IIRC .net core has code to detect concurrent modification. I don't know how they implement it but it could be as simple as a version counter.It's odd he gets so hung up on npe being a critical ingredient when that doesn't appear to be present in the original manifestation. There's no reason to think you can't have a bug like this in C just because it doesn't support exceptions.To me it's all about class invariants. In general, they don't hold while a mutator is executing and will only be restored at the end. Executing another mutator before the invariants are reestablished is how you corrupt your data structure. If you're not in a valid state when you begin you probably won't be in a valid state when you finish.

评论 #43220486 未加载

scottlamb3 months ago

> Could an unguarded TreeMap cause 3,200% utilization?I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.[1] This undersynchronization was far from the only problem with that codebase.

评论 #43213132 未加载

jonathanlydall3 months ago

What about spotting a cycle by using an incrementing counter and then throwing an exception if it goes above the tree depth or collection size (presuming one of these is tracked)?Unlike the author’s hash set proposal it would require almost no memory or CPU overhead and may be more likely to be accepted.That being said, in the decade plus I’ve used C# I’ve never found that I failed to consider concurrent access on data structures in concurrent situations.

评论 #43213055 未加载

评论 #43217912 未加载

评论 #43208618 未加载

deskr3 months ago

Exceptions in threads are an absolute killer.Here's a story of a horror bughunt where the main characters are C++, select() and thread brandishing an exception: <a href="https://news.ycombinator.com/item?id=42532979">https://news.ycombinator.com/item?id=42532979</a>

评论 #43208795 未加载

hinkley3 months ago

The author has discovered a flavor of the Poison Pill. More common in event sourcing systems, it’s a message that kills anything it touches, and then is “eaten” again by the next creature that encounters it which also dies a horrible death. Only in this case it’s live-locked.Once the data structure gets into the illegal state, every subsequent thread gets trapped in the same logic bomb, instead of erroring on an NPE which is the more likely illegal state.

thinkingemote3 months ago

"I could barely ssh onto it"Is there a way to ensure that whatever happens (CPU, network overloaded etc) one can always ssh in? Like reserve a tiny bit of stuff to the ssh daemon?

评论 #43212804 未加载

评论 #43209756 未加载

评论 #43209328 未加载

bsder3 months ago

Does a ConcurrentSkipListMap not give the correct O(log N) guarantees on top of being concurrency friendly?java.util.concurrent is one of the best libraries ever. If you do something related to concurrency and don't reach for it to start, you're gonna have a bad time.

评论 #43208100 未加载

评论 #43208345 未加载

johnklos3 months ago

Does it not strike anyone else as odd that if someone said they had a single CPU, and that CPU were running a normal priority task at 100%, and that caused the machine to barely allow ssh, we'd say there's a much bigger problem than that someone is letting on?No 32 core (thread, likely) machine should ever normally be in a state where someone can "barely ssh onto it". Is Java really that janky? Or is "barely ssh onto it" a bit hyperbolic?

评论 #43213272 未加载

评论 #43217054 未加载

lolc3 months ago

What I like here is the discovery of the extra loop, and then still digging down to discover the root cause of the competing threads. I think I would have removed the loop and called it done.

loeg3 months ago

Just to probe the code review angle a little bit: shared mutable state should be a red/yellow flag in general. Whether or not it is protected from unsafe modification. That should jump out to you as a reviewer that something weird is happening.

MadVikingGod3 months ago

I was excited to see that not only does this article cover other languages, but that this error happens in go. I was a bit surprised because map access is generally protected by a race detector, but the RedBlack tree used doesn't store anything in a map anywhere.I wonder if the full race detector, go run -race, would catch it or not. I also want to explore if the RB tree used a slice instead of two different struct members if that would trigger the runtime race detector. So many things to try when I get back to a computer with go on it.

w10-13 months ago

Very well said, and very nice to see references to others on point.As a sidebar: I'm almost distracted by the clarity. The well-formed structure of this article is a perfect template for an AI evaluation of a problem.It'd be interesting to generate a bunch of these articles, by scanning API docs for usage constraints, then searching the blog-sphere for articles on point, then summarizing issues and solutions, and even generating demo code.Then presto! You have online karma! (and interviews...)But then if only there were a way to credit the authors, or for them to trace some of the good they put out into the world.So, a new metric: PageRank is about (sharing) links-in karma, but AuthorRank is partly about the links out, and in particular the degree to which they seem to be complete in coverage and correct and fair in their characterization of those links. Then a complementary page-quality metric identifies whether the page identifies the proper relationships between the issues, as reflected elsewhere, as briefly as possible in topological order.Then given a set of ordered relations for each page, you can assess similarity with other pages, detect copying (er, learning), etc.

moffkalast3 months ago

Anyone else mildly peeved by how CPU load is just load per core summed up to an arbitrary percentage all too often?Why not just divide 100% by number of cores and make that the max, so you don't need to know the number of cores to know the actual utilization? Or better yet, have those microcontrollers that Intel tries to pass off as E and L cores take up a much smaller percentage to fit their general uselessness.

评论 #43209658 未加载

评论 #43209839 未加载

评论 #43213373 未加载

svilen_dobrev3 months ago

i found this article very/deeply informative , memory-wise , concurency vs optimizations and troubles thereof:Programming Language Memory Models<a href="https://research.swtch.com/plmm" rel="nofollow">https://research.swtch.com/plmm</a><a href="https://news.ycombinator.com/item?id=27750610">https://news.ycombinator.com/item?id=27750610</a>

jchw3 months ago

Somewhat relatedly, I've always appreciated Go adding some runtime checks to detect race conditions, like concurrent map writes. It's not as good as having safe concurrency from the ground up, but on the other hand it's a lot better than not having detection, as everyone makes mistakes from time to time, and imperfect as it may be, the detection usually does catch it quickly. Especially nice since it is on in your production builds; a big obstacle with a lot of debugging tools is they're hard to get them where you need them...

hyperhello3 months ago

Should I read this as the Java TreeMap itself is thread unsafe, and the JVM is in a loop, or that the business logic itself was thread unsafe, and just needed to lock around its transactions?

评论 #43208074 未加载

评论 #43208499 未加载

评论 #43208326 未加载

neonsunset3 months ago

In practice, it's rarely an issue in C# because it offers excellent concurrent collections out of box, together with channels, tasks and other more specialized primitives. Worst case someone just writes a lock(thing) { ... } and calls it a day. Perhaps not great but not the end of the world either.I did have to hand-roll something that partially replicates Rust's RWLock<T> recently, but the resulting semantics turned out to be decent, even if not providing the exact same level of assurance.

fulafel3 months ago

> A while back my machine was so messed up that I could barely ssh onto it. 3,200% CPU utilization - all 32 cores on the host were fully utilized!You wouldn't notice the load from 32 cpu pegging threads or processes on a 32-core host when ssh'ing in. Sounds like the OP is maybe leaving out what the load on the machine was, maybe it was more like thousands of spinning threads?

doctor_phil3 months ago

Why does the fix need to remember all the nodes we have visited? Can't we just keep track of what span we are in? That way we just need to keep track of 2 nodes.In the graphic from the example we would keep track like this:<pre><code> low: - high - low: 11 high: - low: 23 high: - low: 23 high: 26 Error: now we see item 13, but that is not inside our span!</code></pre>

kristianp3 months ago

Has anyone noticed an inability to horizontally scroll the code samples on chrome Android phones? I've noticed it on a few different blogs. The window has to be dragged at lower point on the screen to scroll the code section.

评论 #43234805 未加载

评论 #43221098 未加载

memoryfault3 months ago

This is a fairly common bug in multithreaded use of .NET Dictionary, too.

xandrius3 months ago

Does CPU utilisation calculations work this way?- 1 CPU at 100% = 100%- 10 CPUs at 100% = 100%- 1 CPU at 100% + 9 at 0% = 10%Is that not right? Or is CPU utilisation not usually normalised?

评论 #43221076 未加载

xyst3 months ago

At least it was using all of the cores. The CPU running this application was cooking.

yburkov3 months ago

junior level error, dont understand whats to talk here about... Use ConcurrentSkipListMap and never write our own concurrent data structure unless you Doug Lea

评论 #43233368 未加载

mvc3 months ago

Haha, I was recently running a backfill was quite pleased when I managed to get it humming along at 6400% CPU on a 64vcpu machine. Fortunately ssh was still receptive.

krick3 months ago

Wow, that's pretty high-effort blog post (taking time to compare across 12 mainstream languages).

aqueueaqueue3 months ago

Serious though. What 100% means on cpu dashboards is as inconsistent as what time zone dates are in.

procaryote3 months ago

TL;DR; don't use thread unsafe data structures from multiple threads at once

opentokix3 months ago

click Java close

评论 #43212614 未加载

lcfcjs63 months ago

Java is usually shockingly inefficient.