Very interesting writeup.
Distributed systems problem solving is always a very interesting process. It very frequently uncovers areas ripe for instrumentation improvement.<p>The Erlang Ecosystem seemed very mature and iterated. It almost seemed like the "rails of distributed system" with things like Mnesia.<p>The one downside to that seemed to be that while I was working on grokking the system, the limits and observability of some of these built-in solutions was not so clear. What happens when a mailbox exceeds it's limit? Does the data get dropped? Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).<p>There are answers for all of these interesting scenarios, but in some cases it almost would have been simpler to use an external technology (redis/etc) with established scalability/observability.<p>I do say this knowing that there was plenty I did not get time to learn about the ecosystem in the depth that I desired, but was curious how more experienced Erlang engineers viewed the problem.
This is a great write-up. I love reading stuff like this, and Erlang/OTP/Kafka is definitely on my list of tech to investigate.<p>Slightly tangential, but what's the market like for Erlang developers? I know that its was originally developed for telecoms and phone switches, and Whatsapp use (used?) it in their back-end. Are there particular business sectors that tend to use it now? Geographical spread, perm/contract, salaries, etc?
That was really fun to read! Nice work digging into the root cause.<p>The issue where boxing State#state.partition copies the entire stage object is very counter-intuitive and would have got me as well. I would expect it to only store the partition value.
I felt like a missing conclusion was "Kafka is a critical dependency". They'd started out with the assumption that Kafka is a soft dependency and found this library bug that made it a hard dependency (which they then patched).<p>But isn't going metrics-blind whenever Kafka goes down bad enough that you should push more effort into keeping Kafka alive?
> So our initial 1 GB binary data pretty printed as a string will take about 1 GB × 3.57 characters/byte × 2 words/character × 8 bytes/word = 57.12 GB memory.<p>Yeah, I saw that one in an Erlang system too. It was pretty ugly.
Ok I've looked at this article and it is pretty good. It sounds like there were various Erlang antipatterns in the program, but the actual bug was a user-level memory leak in an Erlang process that locked the scheduler, which isn't good. Also, the memory leak was amplified because it involved serializing an object to memory that contained a lot of repeated references to other objects. So the object itself, while fairly large, still used only a manageable about of memory. But the serialized version's size (because of the repeated content) grew exponentially with the recursion depth. That in turn was due to an Erlang "optimization" that didn't try to indicate the repeated references in the object during serialization. Also of interest was using gdb on the Erlang node to debug this, since the usual Erlang interactive shell was hosed.