Hunting down a C memory leak in a Go program

129 pointsby xerxes901over 3 years ago

13 comments

tialaramexover 3 years ago

This goes on a very exciting journey. But, the leak has a notable property that should cause you to reach for a particular tool quite early just in case. The leak is enormous. The program's leak is much larger than the program itself and is in fact triggering the OOM killer. So my first thought (on Linux) would be to reach for my:<a href="https://github.com/tialaramex/leakdice" rel="nofollow">https://github.com/tialaramex/leakdice</a> (or there's a Rust rewrite <a href="https://github.com/tialaramex/leakdice-rust" rel="nofollow">https://github.com/tialaramex/leakdice-rust</a> because I was learning Rust)leakdice is not a clever, sophisticated tool like valgrind, or eBPF programming, but that's fine because this isn't a subtle problem - it's very blatant - and running leakdice takes seconds so if it wasn't helpful you've lost very little time.Here's what leakdice does: It picks a random heap page of a running process, which you suspect is leaking, and it displays that page as ASCII + hex.That's all, and that might seem completely useless, unless you either read Raymond Chen's "The Old New Thing" or you paid attention in statistics class.Because your program is leaking so badly the vast majority of heap pages (leakdice counts any pages which are writable and anonymous) are leaked. Any random heap page, therefore, is probably leaked. Now, if that page is full of zero bytes you don't learn very much, it's just leaking blank pages, hard to diagnose. But most often you're leaking (as was happening here) something with structure, and very often sort of engineer assigned investigating a leak can look at a 4kbyte page of structure and go "Oh, I know what that is" from staring at the output in hex + ASCII.This isn't a silver bullet, but it's very easy and you can try it in like an hour (not days, or a week) including writing up something like "Alas the leaked pages are empty" which isn't a solution but certainly clarifies future results.

评论 #28887872 未加载

WalterBrightover 3 years ago

I once had to intercept every call malloc/free/realloc and log it to find a leak. I wound up turning that into an immensely useful tool.

评论 #28886302 未加载

评论 #28885377 未加载

评论 #28885583 未加载

cranekamover 3 years ago

Nice write up! Using BPF to trace malloc/free is good example of the tool’s power. Unfortunately, IME, this approach doesn’t scale to very high load services. Once you’re calling malloc/free hundreds of thousands of times a second the overheard of jumping into the kernel every time cripples performance.It would be great if one could configure the uprobes for malloc/free to trigger one in N times but when I last looked they were unconditional. It didn’t help to have the BPF probe just return early, either — the cost is in getting into the kernel to start with.However, jemalloc itself has great support for producing heap profiles with low overhead. Allocations are sampled and the stacks leading to them are recorded in much the same way as the linked BPF approach:<a href="https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Profiling" rel="nofollow">https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Heap-Pro...</a>

评论 #28886233 未加载

G3rn0tiover 3 years ago

> our application was not actually handling events from that queue, so the size of that queue grew without boundWhile new tools are great and I appreciate this nice write-up of how you can use BPF to find memory leaks, I wonder if they could have just guessed the above issue the minute after realizing that Valgrind did not report relevant issues. Actually the program just kept creating objects that were never used. With more context about the offending program, such design issues could be found by the responsible programmers by means of „thinking it through“. What I mean is: Sometimes complicated tooling distracts you so much from the actual problem that you are missing the obvious.

评论 #28887293 未加载

评论 #28891673 未加载

评论 #28888444 未加载

otterleyover 3 years ago

Segment learned quite some time ago that confluent-kafka-go has problems like these (and doesn’t support Contexts either), so they wrote a pure Go replacement instead. <a href="https://github.com/segmentio/kafka-go" rel="nofollow">https://github.com/segmentio/kafka-go</a>

评论 #28885645 未加载

评论 #28887346 未加载

nikanjover 3 years ago

Author uses jmalloc to confirm malloc allocations are unfreed, then later speculates the allocations might be something not visible to Valgrind e.g. mmap.I’ve often done similar mistakes, where data from step 1 already rules out a hypothesis for step 2 - but I’m too sleep-deprived and desperate to realize it. Debugging production issues is the worst.

kubbover 3 years ago

It's insane how much ad hoc engineering and random details like compiler flags were required to get the location where the unfreed memory was allocated. It's likely that an experienced team was on it for several days (unless they already had experience with all the tools used).It's also crazy how the bug could be tied back to an unbounded queue that was backing up. It seems like the wrapper library should be designed in a way where not handling the queue events is hard to do, meanwhile the experts walked right into that.

mmollover 3 years ago

I suspect valgrind‘s massif would have helped (massifly). It shows memory usage over time, but also where what fraction of memory was allocated.

评论 #28890356 未加载

jjluomaover 3 years ago

I wonder if statistics provided by librdkafka (available also with confluent-kafka-go) could have been used to solve the issue with less effort.<a href="https://github.com/edenhill/librdkafka/blob/master/STATISTICS.md" rel="nofollow">https://github.com/edenhill/librdkafka/blob/master/STATISTIC...</a>

GnarfGnarfover 3 years ago

On an unrelated note, I am a Zendesk customer and absolutely love the app. Zendesk makes customer support fun!

matt123456789over 3 years ago

The author describes using eBPF to trace malloc/free refs as a solution to the program properly freeing all heap objects before exiting, which was enlightening to me. Would it have been possible to issue a kill -9 to the program in the middle of execution while using valgrind to see this info as well? Or is it more to the point that eBPF is cleaner and allows you to see many more snapshots of memory allocations while the program is still running?

评论 #28886069 未加载

sam0x17over 3 years ago

Fun side note: I once had to debug a GC stuttering issue in Crystal, and was delighted to find that the language was so damn open that I could just monkey-patch the actual allocator to print debug information whenever an allocation was made.

richardfeyover 3 years ago

So the root cause was...not reading librdkafka documentation?

评论 #28886223 未加载

评论 #28886982 未加载