TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Tyranny of the Flake Equation

78 点作者 brilee11 个月前

8 条评论

JohnMakin11 个月前
When I was still a student I worked mostly with Java and had to solve a simple word frequency problem for a series of assignments. The first several assignments everything worked as expected, and then one case in the test suite kept crashing the JVM or freezing it entirely.<p>It was then that I learned that String is immutable in Java and that when I tried to ingest a novel into a string one token at a time like str += token; I was destroying&#x2F;recreating the entire string each time, which was fine during the small cases, but had a hidden exponential effect.<p>It was then I learned about StringBuilder and laughed that such a seemingly innocent line would contain such a large footgun. As a student, I had absolutely no way of learning about language semantics like that except through that mistake.
评论 #40706892 未加载
评论 #40706553 未加载
hansvm11 个月前
When flakes are caused by bugs, the asymptotics are much worse in large software projects. Ignoring context and individualized distributions for each test for simplicity (in the spirit of the blog post), any given unit of buggy code has some probability p of causing a flake. When that code is used n times (or, more abhorrently, when there are n levels of p-flaky code) in the process, you have a 1-(1-p)^n chance of a given test failing. Apply the flake equation, and you get a runtime of:<p>O(f exp((1-(1-p)^n) f))<p>Your worst-case asymptote is still O(f exp(f)), but you hit it incredibly quickly because of the additional exponential in p. A single additional layer of abstraction (calling the same buggy subroutines a few more times) is all it takes to tip the scales from a manageable testing environment to a flaky time-suck.<p>It&#x27;s not just in testing. If your software has those sorts of problems and isn&#x27;t designed from the ground up to reduce the impact of errors (like how the shuttle program was built hierarchially in terms of redundant subsystems, and the flakiness of individual parts didn&#x27;t have an exponential impact on the flakiness of the composite system), the same equations dictate that you&#x27;ll have a critical threshold beyond which you have customer-facing bugs you can no longer ignore.<p>The author argues a bit against reducing the baseline flakiness (requiring expensive engineers for an unbounded set of potential issues). That&#x27;s probably a reasonable approach for cosmic rays, rowhammer, and whatnot for most software applications. Compared to writing the code in the first place though, it&#x27;s usually very cheap to also (1) name the function well, (2) make the function total (behaving reasonably for all inputs allowed by the type system), and (3) add a couple tests for that function. The extra development velocity you get from not building on shaky foundations is usually substantial in comparison. The second you have to worry about a transitive bug in the stack you have an exponential explosion in code-paths to examine when the feature you&#x27;re building has any sort of reliability requirements, and those exponential costs add up quickly compared to a flat +10%, +30%, +100%, ... in time to commit each solid unit of code.
08234987234987211 个月前
Just like a basic part of knife safety is &quot;don&#x27;t grab the pointy end&quot;, a basic part of computing is &quot;don&#x27;t grab the exponential end&quot;.<p>(recognising when you might have a hold of an exponential by the wrong end can be done by experience, experiment, or exercise of theory, so there isn&#x27;t much of an excuse for continuing to do it)
MarkusQ11 个月前
One trick that works surprisingly often when you don&#x27;t understand the root cause: scale the number of batches, rather than the batch size (think merge sort or map&#x2F;reduce).<p>If 10% of your batches flake out, you rerun them and expect 1% to have to be run a third time, etc., giving you an O((batch run time)*log(number of batches)) expected wait until you can merge&#x2F;reduce. Making the batches smaller means you have a shorter wait to merge&#x2F;reduce (but also possibly worse merge time, so it doesn&#x27;t always work; YMMV, etc.).
shiandow11 个月前
Not sure about the equation at the end. Looks like you may have incorrectly thought that e^(x&#x2F;y) = e^x &#x2F; e^y.<p>I get wanting to get a sense of the runtime compared to the time spent checkpointing but the equation as stated is wrong. A better one is<p>Ne^(l&#x2F;f) &#x2F; Cf = f&#x2F;l<p>Which basically means the ratio between checkpointing and runtime (<i>including retries</i>) should be inversely proportional to the ratio of checkpoints and flakes.<p>Also really, you&#x27;re calling the flakiness parameter lambda and the checkpointing parameter f?
hackandthink11 个月前
Is this really a Poisson process?<p>Assuming that every Job succeeds with the same probability -&gt; Bernoulli Process.<p>The probability of overall success would still be exponential (p^n).
thyrsus11 个月前
This is why the erlang OTP is so effective: it pushes you toward error (flake) recovery and idempotency.
awestroke11 个月前
Coincidentally, nix flakes is a good way to reduce the flakiness of development environments
评论 #40706411 未加载