TechEcho

8 comments

When I was still a student I worked mostly with Java and had to solve a simple word frequency problem for a series of assignments. The first several assignments everything worked as expected, and then one case in the test suite kept crashing the JVM or freezing it entirely.It was then that I learned that String is immutable in Java and that when I tried to ingest a novel into a string one token at a time like str += token; I was destroying/recreating the entire string each time, which was fine during the small cases, but had a hidden exponential effect.It was then I learned about StringBuilder and laughed that such a seemingly innocent line would contain such a large footgun. As a student, I had absolutely no way of learning about language semantics like that except through that mistake.

评论 #40706892 未加载

评论 #40706553 未加载

hansvm11 months ago

When flakes are caused by bugs, the asymptotics are much worse in large software projects. Ignoring context and individualized distributions for each test for simplicity (in the spirit of the blog post), any given unit of buggy code has some probability p of causing a flake. When that code is used n times (or, more abhorrently, when there are n levels of p-flaky code) in the process, you have a 1-(1-p)^n chance of a given test failing. Apply the flake equation, and you get a runtime of:O(f exp((1-(1-p)^n) f))Your worst-case asymptote is still O(f exp(f)), but you hit it incredibly quickly because of the additional exponential in p. A single additional layer of abstraction (calling the same buggy subroutines a few more times) is all it takes to tip the scales from a manageable testing environment to a flaky time-suck.It's not just in testing. If your software has those sorts of problems and isn't designed from the ground up to reduce the impact of errors (like how the shuttle program was built hierarchially in terms of redundant subsystems, and the flakiness of individual parts didn't have an exponential impact on the flakiness of the composite system), the same equations dictate that you'll have a critical threshold beyond which you have customer-facing bugs you can no longer ignore.The author argues a bit against reducing the baseline flakiness (requiring expensive engineers for an unbounded set of potential issues). That's probably a reasonable approach for cosmic rays, rowhammer, and whatnot for most software applications. Compared to writing the code in the first place though, it's usually very cheap to also (1) name the function well, (2) make the function total (behaving reasonably for all inputs allowed by the type system), and (3) add a couple tests for that function. The extra development velocity you get from not building on shaky foundations is usually substantial in comparison. The second you have to worry about a transitive bug in the stack you have an exponential explosion in code-paths to examine when the feature you're building has any sort of reliability requirements, and those exponential costs add up quickly compared to a flat +10%, +30%, +100%, ... in time to commit each solid unit of code.

08234987234987211 months ago

Just like a basic part of knife safety is "don't grab the pointy end", a basic part of computing is "don't grab the exponential end".(recognising when you might have a hold of an exponential by the wrong end can be done by experience, experiment, or exercise of theory, so there isn't much of an excuse for continuing to do it)

MarkusQ11 months ago

One trick that works surprisingly often when you don't understand the root cause: scale the number of batches, rather than the batch size (think merge sort or map/reduce).If 10% of your batches flake out, you rerun them and expect 1% to have to be run a third time, etc., giving you an O((batch run time)*log(number of batches)) expected wait until you can merge/reduce. Making the batches smaller means you have a shorter wait to merge/reduce (but also possibly worse merge time, so it doesn't always work; YMMV, etc.).

shiandow11 months ago

Not sure about the equation at the end. Looks like you may have incorrectly thought that e^(x/y) = e^x / e^y.I get wanting to get a sense of the runtime compared to the time spent checkpointing but the equation as stated is wrong. A better one isNe^(l/f) / Cf = f/lWhich basically means the ratio between checkpointing and runtime (including retries) should be inversely proportional to the ratio of checkpoints and flakes.Also really, you're calling the flakiness parameter lambda and the checkpointing parameter f?

hackandthink11 months ago

Is this really a Poisson process?Assuming that every Job succeeds with the same probability -> Bernoulli Process.The probability of overall success would still be exponential (p^n).

thyrsus11 months ago

This is why the erlang OTP is so effective: it pushes you toward error (flake) recovery and idempotency.

awestroke11 months ago

Coincidentally, nix flakes is a good way to reduce the flakiness of development environments

评论 #40706411 未加载

8 comments

JohnMakin11 months ago

评论 #40706892 未加载

评论 #40706553 未加载

hansvm11 months ago

08234987234987211 months ago

MarkusQ11 months ago

shiandow11 months ago

hackandthink11 months ago

Is this really a Poisson process?Assuming that every Job succeeds with the same probability -> Bernoulli Process.The probability of overall success would still be exponential (p^n).

thyrsus11 months ago

This is why the erlang OTP is so effective: it pushes you toward error (flake) recovery and idempotency.

awestroke11 months ago

Coincidentally, nix flakes is a good way to reduce the flakiness of development environments

评论 #40706411 未加载

The Tyranny of the Flake Equation

8 comments

The Tyranny of the Flake Equation

8 comments