TechEcho

10 comments

andrewvcover 12 years ago

I found a bug in eventmachine during the time I spent writing em-zeromq, the eventmachine binding for zeromq. The important thing to understand here is that ZMQ is, in-essence, a userland socket. Normal sockets are efficiently monitored using the epoll system call (or one of its older variants, say select, or poll). However, as a user-land programZMQ 'sockets' aren't compatible with those calls, they use a userland equivalent of those kernel level edge triggered pollers. Integrating ZMQ with a traditional event library (like eventmachine) presents a problem at this point, as software like EM or Node typically require IO to be across real file descriptors from real sockets, something a userland library can't provide. The ZMQ devs however realized this was a hotly requested feature and so devised a way around this limitation.The compatibility layer in ZMQ takes the form of performing some internal communication across traditional unix IPC, in the case using a pipe IIRC. In other words, for some of its internal messaging rather than simply use a function call, ZMQ will push data across a pipe. This pipe can then be exposed as a proxy for a ZMQ socket.The downside of this strategy is that exposing FDs across software requires extreme care. Generally, it is assumed that one piece of software will have responsibility for an FD.The actual issue in my case was that any ruby exception would cause the entire process to crash with an error about closing an already closed FD. What was happening was that given an exception both ZMQ and EM were trying to shut down all the FDs they knew about. Closing an FD that's already closed causes ZMQ to assert and crash instantly. It sounds simple once you're in the right frame of mind, but it took a good number of evenings to track down to that cause. It turned out the the EM option to not shut-down FDs was non-functional in the end. A one character patch provided the fix.

twoodfinover 12 years ago

Like danso, I admire the detective-work here. I would like to point out, though, that XCode's Instruments utility has a fantastically useful "Leaks" mode that will identify leaked allocations, including a stack trace. It can attach to a running process and has a non-disastrous impact on performance, though like most such tools it's voracious for memory.Other platforms likely have similar tools, though I have yet to stumble across one as easy to use.

评论 #5339282 未加载

评论 #5341211 未加载

评论 #5339336 未加载

gingerlimeover 12 years ago

"If you’ve stared at too many Linux coredumps, as I have, that number looks suspicious. Interpreted in little-endian, that is 0x00007f1b5358a800, which points near the top of the userspace portion of the address space on an amd64 Linux machine.In fewer words: It’s most likely a pointer."As someone who has not stared at any core dump for more than about 2 seconds, I admire this level of skill.

pcover 12 years ago

I wish there were more posts like this.

评论 #5339308 未加载

评论 #5339320 未加载

lkrubnerover 12 years ago

This is slightly off-topic, but I worked on a Ruby project where we did something just like this:"It was easy enough to work around the leak by adding monitoring and restarting the process whenever memory usage grew too large"I was surprised, because I can not think of any other language and/or framework where "just restart the process" is done so often. I mean, this is not a common attitude among Java programmers, I don't think it is common among C programmers, and I don't think it is common among Python programmers. But it does seem to be fairly standard in the Ruby community. David Heinemeier Hansson admitted this used to happen with Basecamp:<a href="http://david.heinemeierhansson.com/posts/31-myth-2-rails-is-expected-to-crash-400-timesday" rel="nofollow">http://david.heinemeierhansson.com/posts/31-myth-2-rails-is-...</a>Can anyone else tell me of a community where this is done so commonly?

评论 #5341368 未加载

评论 #5343355 未加载

评论 #5341364 未加载

评论 #5341354 未加载

dansoover 12 years ago

There are two things I really like about this walkthrough:1. It shows how bugs can be something quite conceptually simple2. It shows the value of logical, detective-like thinking in tracking these bugs.Even as a programmer, I still think of bug-hunting as something requiring an encyclopedia knowledge of the trivial and arcane. Obviously, it looks easier in hindsight, but the OP does a great job of demonstrating how you can discover a much-overlooked flaw with the right deductive thinking (and experience with profiling tools)

评论 #5340219 未加载

gravitronicover 12 years ago

THANK YOU - after looking at the heap perspective, and valgrind perspective, I've exhausted options looking at a memory leak in our production environment. This is another avenue. AWESOME.

pimeysover 12 years ago

I noticed a same kind of leak in one of our apps and this post was more than worth of gold on how to hunt it down.

rurounijonesover 12 years ago

Now if there were an online course where you could learn this kind of stuff I would be signed up so fast physicists would be re-evaluating general relativity.We need some online courses dedicated to not-beginners :)

vskrover 12 years ago

Amazing post. How much time did it take to debug and identify the issue

评论 #5341391 未加载

10 comments

andrewvcover 12 years ago

twoodfinover 12 years ago

评论 #5339282 未加载

评论 #5341211 未加载

评论 #5339336 未加载

gingerlimeover 12 years ago

pcover 12 years ago

I wish there were more posts like this.

评论 #5339308 未加载

评论 #5339320 未加载

lkrubnerover 12 years ago

评论 #5341368 未加载

评论 #5343355 未加载

评论 #5341364 未加载

评论 #5341354 未加载

dansoover 12 years ago

评论 #5340219 未加载

gravitronicover 12 years ago

THANK YOU - after looking at the heap perspective, and valgrind perspective, I've exhausted options looking at a memory leak in our production environment. This is another avenue. AWESOME.

pimeysover 12 years ago

I noticed a same kind of leak in one of our apps and this post was more than worth of gold on how to hunt it down.

rurounijonesover 12 years ago

vskrover 12 years ago

Amazing post. How much time did it take to debug and identify the issue

评论 #5341391 未加载

Tracking down a memory leak in Ruby's EventMachine

10 comments

Tracking down a memory leak in Ruby's EventMachine

10 comments