TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Crash-only software: More than meets the eye (2006)

59 点作者 hui-zheng大约 1 年前

8 条评论

mjb大约 1 年前
&gt; Crash-only software is actually more reliable because it takes into account from the beginning an unavoidable fact of computing - unexpected crashes.<p>This is a critical point for reliable single-machine systems, and for reliable distributed systems. Distributed systems avoid many classes of crashes through redundancy, allowing the overall system to recover (often with no impact) from the failure or crash of a single node. This provides an additional path to crash recovery: recovering from peers or replicas rather than from local state. In turn, this can simplify the tracking of local state (especially the kind of per-replica WAL or redo log accounting that database systems have to do), leading to improved performance and avoiding bugs.<p>But, as with single-system crashes, distributed systems need to deal with their own reality: correlated failures. These can be caused by correlated infrastructure failures (power, cooling, etc), by operations (e.g. deploying buggy software), or by the very data they&#x27;re processing (e.g. a &quot;poison pill&quot; that crashes all the redundant nodes at once). And so, like the crash-only case with single-system software, reliable distributed systems need to be designed to recover from these correlated failure cases.<p>The constants are interestingly different, though. Single-system annual interrupt rates (AIR) are typically in the 1-10% range, while systems spread over multiple datacenters can feasibly see correlated failure rates several orders of magnitude lower. This could argue that having a &quot;bad day&quot; recovery path that&#x27;s more expensive than regular node recovery is OK. Or, it could argue that the only feasible way of making sure that &quot;bad day&quot; recovery works is to exercise it often (which goes back to the crash-only argument).
fl0ki大约 1 年前
Intentional crashing can be fine. Unintentional crashing with telemetry can be fine because you&#x27;re going to fix it.<p>Unintentional crashing <i>without</i> telemetry is terrible. I&#x27;ve seen too many systems built to &quot;just panic because it&#x27;ll restart and retry&quot; that never converge because the retry hits the same conditions and no thought was put into how to monitor what is going wrong.<p>As you all know, such systems tend to also neglect jitter and backoff so the retrying clients also hot-loop slamming every dependency, even ones that weren&#x27;t erroring prior to the crash.<p>I&#x27;ve seen people shell into k8s pods and poke around at files manually for an all-nighter because they didn&#x27;t invest even one hour in telemetry beforehand. Even that was a second penance for the first crime: finding out about an outage because of a user escalation rather than an automated alert.<p>Ironically, at times, some attempt at monitoring was made but undermined by the crash, e.g. Prometheus metrics were exported but lost before they could be scraped.<p>We have a long way to go educating most developers about production maturity before it&#x27;s safe to endorse crashing without accounting for the downsides.<p>This was written in 2006 when monitoring was barely on anyone&#x27;s radar. It&#x27;s understandable in that context. People reading it in a modern context have to BYO production maturity.
WhyNotHugo大约 1 年前
The time it takes for some systems to shut down doesn&#x27;t make sense to me.<p>My Alpine laptop takes about 3 seconds to shut down (which, honestly, seems like a lot of time). systemd-based system will give daemons 90000ms to shut down, which is an absurdly high amount of time (what kind of service can&#x27;t exit in a few seconds?).<p>Honestly, I think that mostly the kernel needs to flush its caches, SIGTERM all processes and then halt. There&#x27;s no reason for this to take more than 1s on a modern system, and if something takes too long to handle SIGTERM, then it&#x27;ll go through recovery next time.
评论 #40231125 未加载
Liftyee大约 1 年前
In embedded systems watchdog timers are often used as a crash mechanism outside of the software itself, which will crash the program if it is not reset. I found this concept of crash-only software pretty neat - time to see if I can apply it.
评论 #40218518 未加载
kayodelycaon大约 1 年前
Crash-only is really hard to implement if another system is involved that isn&#x27;t crash-only. If you crash in the middle of a network request, you may not know what state the other system is in.<p>I&#x27;ve had to deal with buggy mainframe software whose error messages had no relation to how much an operation succeeded. (And no way to ask it after the fact...) Welcome to the special hell.
评论 #40215739 未加载
评论 #40220818 未加载
评论 #40219895 未加载
评论 #40218514 未加载
mpweiher大约 1 年前
macOS implemented this concept with &quot;sudden termination&quot;.<p>Applications that opt-in and announce themselves as &quot;clean&quot; (via a flag in a shared page last I checked) can be killed at anytime by the system via kill -9.
评论 #40245503 未加载
ashleyn大约 1 年前
Probably should consider crash recovery as a second line of defense against lost data, not the primary line of defense. What are the stats on how often crash recovery failed?
评论 #40215924 未加载
评论 #40215261 未加载
dinvlad大约 1 年前
Man, I love Elixir even though I&#x27;m still trying to learn it