TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The Tyranny of the Flake Equation

78 pointsby brilee11 months ago

8 comments

JohnMakin11 months ago
When I was still a student I worked mostly with Java and had to solve a simple word frequency problem for a series of assignments. The first several assignments everything worked as expected, and then one case in the test suite kept crashing the JVM or freezing it entirely.<p>It was then that I learned that String is immutable in Java and that when I tried to ingest a novel into a string one token at a time like str += token; I was destroying&#x2F;recreating the entire string each time, which was fine during the small cases, but had a hidden exponential effect.<p>It was then I learned about StringBuilder and laughed that such a seemingly innocent line would contain such a large footgun. As a student, I had absolutely no way of learning about language semantics like that except through that mistake.
评论 #40706892 未加载
评论 #40706553 未加载
hansvm11 months ago
When flakes are caused by bugs, the asymptotics are much worse in large software projects. Ignoring context and individualized distributions for each test for simplicity (in the spirit of the blog post), any given unit of buggy code has some probability p of causing a flake. When that code is used n times (or, more abhorrently, when there are n levels of p-flaky code) in the process, you have a 1-(1-p)^n chance of a given test failing. Apply the flake equation, and you get a runtime of:<p>O(f exp((1-(1-p)^n) f))<p>Your worst-case asymptote is still O(f exp(f)), but you hit it incredibly quickly because of the additional exponential in p. A single additional layer of abstraction (calling the same buggy subroutines a few more times) is all it takes to tip the scales from a manageable testing environment to a flaky time-suck.<p>It&#x27;s not just in testing. If your software has those sorts of problems and isn&#x27;t designed from the ground up to reduce the impact of errors (like how the shuttle program was built hierarchially in terms of redundant subsystems, and the flakiness of individual parts didn&#x27;t have an exponential impact on the flakiness of the composite system), the same equations dictate that you&#x27;ll have a critical threshold beyond which you have customer-facing bugs you can no longer ignore.<p>The author argues a bit against reducing the baseline flakiness (requiring expensive engineers for an unbounded set of potential issues). That&#x27;s probably a reasonable approach for cosmic rays, rowhammer, and whatnot for most software applications. Compared to writing the code in the first place though, it&#x27;s usually very cheap to also (1) name the function well, (2) make the function total (behaving reasonably for all inputs allowed by the type system), and (3) add a couple tests for that function. The extra development velocity you get from not building on shaky foundations is usually substantial in comparison. The second you have to worry about a transitive bug in the stack you have an exponential explosion in code-paths to examine when the feature you&#x27;re building has any sort of reliability requirements, and those exponential costs add up quickly compared to a flat +10%, +30%, +100%, ... in time to commit each solid unit of code.
08234987234987211 months ago
Just like a basic part of knife safety is &quot;don&#x27;t grab the pointy end&quot;, a basic part of computing is &quot;don&#x27;t grab the exponential end&quot;.<p>(recognising when you might have a hold of an exponential by the wrong end can be done by experience, experiment, or exercise of theory, so there isn&#x27;t much of an excuse for continuing to do it)
MarkusQ11 months ago
One trick that works surprisingly often when you don&#x27;t understand the root cause: scale the number of batches, rather than the batch size (think merge sort or map&#x2F;reduce).<p>If 10% of your batches flake out, you rerun them and expect 1% to have to be run a third time, etc., giving you an O((batch run time)*log(number of batches)) expected wait until you can merge&#x2F;reduce. Making the batches smaller means you have a shorter wait to merge&#x2F;reduce (but also possibly worse merge time, so it doesn&#x27;t always work; YMMV, etc.).
shiandow11 months ago
Not sure about the equation at the end. Looks like you may have incorrectly thought that e^(x&#x2F;y) = e^x &#x2F; e^y.<p>I get wanting to get a sense of the runtime compared to the time spent checkpointing but the equation as stated is wrong. A better one is<p>Ne^(l&#x2F;f) &#x2F; Cf = f&#x2F;l<p>Which basically means the ratio between checkpointing and runtime (<i>including retries</i>) should be inversely proportional to the ratio of checkpoints and flakes.<p>Also really, you&#x27;re calling the flakiness parameter lambda and the checkpointing parameter f?
hackandthink11 months ago
Is this really a Poisson process?<p>Assuming that every Job succeeds with the same probability -&gt; Bernoulli Process.<p>The probability of overall success would still be exponential (p^n).
thyrsus11 months ago
This is why the erlang OTP is so effective: it pushes you toward error (flake) recovery and idempotency.
awestroke11 months ago
Coincidentally, nix flakes is a good way to reduce the flakiness of development environments
评论 #40706411 未加载