TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The hunt for a cluster-killer Erlang bug (2021)

291 pointsby eproxusalmost 3 years ago

9 comments

banasharkalmost 3 years ago
Very interesting writeup. Distributed systems problem solving is always a very interesting process. It very frequently uncovers areas ripe for instrumentation improvement.<p>The Erlang Ecosystem seemed very mature and iterated. It almost seemed like the &quot;rails of distributed system&quot; with things like Mnesia.<p>The one downside to that seemed to be that while I was working on grokking the system, the limits and observability of some of these built-in solutions was not so clear. What happens when a mailbox exceeds it&#x27;s limit? Does the data get dropped? Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).<p>There are answers for all of these interesting scenarios, but in some cases it almost would have been simpler to use an external technology (redis&#x2F;etc) with established scalability&#x2F;observability.<p>I do say this knowing that there was plenty I did not get time to learn about the ecosystem in the depth that I desired, but was curious how more experienced Erlang engineers viewed the problem.
评论 #31747190 未加载
评论 #31748247 未加载
评论 #31747066 未加载
andyjohnson0almost 3 years ago
This is a great write-up. I love reading stuff like this, and Erlang&#x2F;OTP&#x2F;Kafka is definitely on my list of tech to investigate.<p>Slightly tangential, but what&#x27;s the market like for Erlang developers? I know that its was originally developed for telecoms and phone switches, and Whatsapp use (used?) it in their back-end. Are there particular business sectors that tend to use it now? Geographical spread, perm&#x2F;contract, salaries, etc?
评论 #31751285 未加载
评论 #31751598 未加载
waynesonfirealmost 3 years ago
That was really fun to read! Nice work digging into the root cause.<p>The issue where boxing State#state.partition copies the entire stage object is very counter-intuitive and would have got me as well. I would expect it to only store the partition value.
评论 #31748084 未加载
tiffanyhalmost 3 years ago
Fantastic detailed write up. Wish there was more of these style of articles on HN.
评论 #31749872 未加载
评论 #31747703 未加载
waisbrotalmost 3 years ago
I felt like a missing conclusion was &quot;Kafka is a critical dependency&quot;. They&#x27;d started out with the assumption that Kafka is a soft dependency and found this library bug that made it a hard dependency (which they then patched).<p>But isn&#x27;t going metrics-blind whenever Kafka goes down bad enough that you should push more effort into keeping Kafka alive?
评论 #31775490 未加载
rramadassalmost 3 years ago
Relevant: <a href="https:&#x2F;&#x2F;www.erlang-in-anger.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.erlang-in-anger.com&#x2F;</a>
davidwalmost 3 years ago
&gt; So our initial 1 GB binary data pretty printed as a string will take about 1 GB × 3.57 characters&#x2F;byte × 2 words&#x2F;character × 8 bytes&#x2F;word = 57.12 GB memory.<p>Yeah, I saw that one in an Erlang system too. It was pretty ugly.
评论 #31750698 未加载
throwaway81523almost 3 years ago
Ok I&#x27;ve looked at this article and it is pretty good. It sounds like there were various Erlang antipatterns in the program, but the actual bug was a user-level memory leak in an Erlang process that locked the scheduler, which isn&#x27;t good. Also, the memory leak was amplified because it involved serializing an object to memory that contained a lot of repeated references to other objects. So the object itself, while fairly large, still used only a manageable about of memory. But the serialized version&#x27;s size (because of the repeated content) grew exponentially with the recursion depth. That in turn was due to an Erlang &quot;optimization&quot; that didn&#x27;t try to indicate the repeated references in the object during serialization. Also of interest was using gdb on the Erlang node to debug this, since the usual Erlang interactive shell was hosed.
tpmxalmost 3 years ago
I thought Klarna had moved away from Erlang, mostly towards Java. I guess not.
评论 #31749876 未加载
评论 #31748029 未加载