TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The hunt for a cluster-killer Erlang bug (2021)

291 点作者 eproxus将近 3 年前

9 条评论

banashark将近 3 年前
Very interesting writeup. Distributed systems problem solving is always a very interesting process. It very frequently uncovers areas ripe for instrumentation improvement.<p>The Erlang Ecosystem seemed very mature and iterated. It almost seemed like the &quot;rails of distributed system&quot; with things like Mnesia.<p>The one downside to that seemed to be that while I was working on grokking the system, the limits and observability of some of these built-in solutions was not so clear. What happens when a mailbox exceeds it&#x27;s limit? Does the data get dropped? Or, how to recover from a network segmentation? These proved somewhat challenging to reproduce and troubleshoot (as distributed problems can be).<p>There are answers for all of these interesting scenarios, but in some cases it almost would have been simpler to use an external technology (redis&#x2F;etc) with established scalability&#x2F;observability.<p>I do say this knowing that there was plenty I did not get time to learn about the ecosystem in the depth that I desired, but was curious how more experienced Erlang engineers viewed the problem.
评论 #31747190 未加载
评论 #31748247 未加载
评论 #31747066 未加载
andyjohnson0将近 3 年前
This is a great write-up. I love reading stuff like this, and Erlang&#x2F;OTP&#x2F;Kafka is definitely on my list of tech to investigate.<p>Slightly tangential, but what&#x27;s the market like for Erlang developers? I know that its was originally developed for telecoms and phone switches, and Whatsapp use (used?) it in their back-end. Are there particular business sectors that tend to use it now? Geographical spread, perm&#x2F;contract, salaries, etc?
评论 #31751285 未加载
评论 #31751598 未加载
waynesonfire将近 3 年前
That was really fun to read! Nice work digging into the root cause.<p>The issue where boxing State#state.partition copies the entire stage object is very counter-intuitive and would have got me as well. I would expect it to only store the partition value.
评论 #31748084 未加载
tiffanyh将近 3 年前
Fantastic detailed write up. Wish there was more of these style of articles on HN.
评论 #31749872 未加载
评论 #31747703 未加载
waisbrot将近 3 年前
I felt like a missing conclusion was &quot;Kafka is a critical dependency&quot;. They&#x27;d started out with the assumption that Kafka is a soft dependency and found this library bug that made it a hard dependency (which they then patched).<p>But isn&#x27;t going metrics-blind whenever Kafka goes down bad enough that you should push more effort into keeping Kafka alive?
评论 #31775490 未加载
rramadass将近 3 年前
Relevant: <a href="https:&#x2F;&#x2F;www.erlang-in-anger.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.erlang-in-anger.com&#x2F;</a>
davidw将近 3 年前
&gt; So our initial 1 GB binary data pretty printed as a string will take about 1 GB × 3.57 characters&#x2F;byte × 2 words&#x2F;character × 8 bytes&#x2F;word = 57.12 GB memory.<p>Yeah, I saw that one in an Erlang system too. It was pretty ugly.
评论 #31750698 未加载
throwaway81523将近 3 年前
Ok I&#x27;ve looked at this article and it is pretty good. It sounds like there were various Erlang antipatterns in the program, but the actual bug was a user-level memory leak in an Erlang process that locked the scheduler, which isn&#x27;t good. Also, the memory leak was amplified because it involved serializing an object to memory that contained a lot of repeated references to other objects. So the object itself, while fairly large, still used only a manageable about of memory. But the serialized version&#x27;s size (because of the repeated content) grew exponentially with the recursion depth. That in turn was due to an Erlang &quot;optimization&quot; that didn&#x27;t try to indicate the repeated references in the object during serialization. Also of interest was using gdb on the Erlang node to debug this, since the usual Erlang interactive shell was hosed.
tpmx将近 3 年前
I thought Klarna had moved away from Erlang, mostly towards Java. I guess not.
评论 #31749876 未加载
评论 #31748029 未加载