> "A Single Line of Code Brought Down a Half-Billion Euro Rocket Launch"<p>Blaming a system failure on a single point like this dooms that system to repeat similar failures (albeit in another element) in the future.<p>There are numerous testing, quality and risk controls that could've been in place. There are probably even a few people who didn't do their job (besides the one person a decade ago who wrote the 'single line'). The point isn't to pin blame on any one point, but to look at the system (people, processes, technology) and try to understand why the system is fragile enough that a single person's error is able to escalate into a half-billion euro error.<p>By focusing in on the point of failure, you end up falling victim to survivorship bias [0]. It is how you end up with developer teams swamped with unit-testing requirements and test coverage metrics, but still somehow end up with errors that impact the end-user anyway. It is how you get company surveys that always seem to miss the point, saying that the measures they implemented to improve company culture worked, yet everyone is burning out and miserable.<p>[0] - <a href="https://en.wikipedia.org/wiki/Survivorship_bias" rel="nofollow">https://en.wikipedia.org/wiki/Survivorship_bias</a>
This is like saying a single little match blew up a building, neglecting to mention the garage full of oily rags and gasoline cans.<p>The one line of code was the spark, yes, but the catastrophic consequences were due to a series of poorly-designed failsafes and insufficient testing.
> The system is designed to have a backup, standby system, which unfortunately, runs the exact same code.<p>At Boeing, the backup system runs on a different CPU architecture, with a different program design, a different programming language, and a different team that isn't allowed to talk with the team on the other path.
>> The cause? A simple, and very much avoidable coding bug, from a piece of dead code, left over from the previous Ariane 4 mission, which started nearly a decade before.<p>>> The worst part? The code wasn’t necessary after takeoff, it was only part of the launch pad alignment process. But sometimes a trivial glitch might delay a launch by a few seconds and, in trying to save having to reset the whole system, the original software engineers decided that the sequence of code should run for an extra… 40 seconds after the scheduled liftoff.<p>The author appears to be using a different definition of "dead code" than I'm used to. To me, dead code is code that is no longer called by anything else, and has no chance of running. Maybe a more accurate term is "legacy code"?
> With 16-bit unsigned integers, you can store anything from 0 to 65,535. If you use the first bit to store a sign (positive/negative) and your 16-bit signed integer now covers everything from -32,768 to +32,767 (only 15 bits left for the actual number). Anything bigger than these values and you’ve run out of bits.<p>That's, oh man, that's not how they're stored or how you should think of it. Don't think of it that way because if you think "oh 1 bit for sign" that implies the number representation has both a +0 and a -0 (which is the case for ieee 754 floats) that are bitwise different in at least the sign bit, which isn't the case for signed ints. Plus, if you have that double zero that comes from dedicating a bit to sign, then you can't represent 2^15 or -2^15, because you are instead representing -0 and +0. Except, you can represent -2^15, or -32,768, by their own prose. So there's either more than just 15 bits for negative numbers or there's not actually a "sign bit."<p>Like, ok, sure, you don't want to explain the intricacies of 2's complement for this, but don't say there's a sign bit. Explain signed ints as a shifting the range of possible values to include negative and positive values. Something like<p>> With 16-bit unsigned integers, you can store anything from 0 to 65,535. If you shift that range down so that 0 is in the middle of the range of values instead of the minimum and your 16-bit signed integer now covers everything from -32,768 to +32,767. Anything outside the range of these values and you’ve run out of bits.
I have found that among software engineers, it is surprisingly not common knowledge that floating point operations have all these sharp edges and gotchas.<p>The most common situation in which it crops up is when dealing with quantities that require fractional units/arithmetic of some commonly discrete unit of measure. For example, you implement some complex logic to do request sampling, and in your binary you convert the total number of active requests to a float, add some stuff, divide some stuff, add some more stuff, multiply it again, then convert back to an int something like “number of requests that should be sampled.” Because floating point operations are non-associative, non-distributive, and commonly introduce remainder artifacts, you can end up with results like sampling 1 more request than there are total requests active, even when the arithmetic itself seems like that should be impossible.<p>This is also common when dealing with time, although typically the outcome is not that bad. Despite time having a simple workaround of just changing the unit of measure (eg using milliseconds instead of seconds) and using int operations on that, because people don’t know <i>why</i> they shouldn’t use floating point operations in this case, they don’t always reach for it.<p>The worst is when some complicated operation is done to report a float (or int converted from a float) as a metric. In the request sampling example, that would likely be noticed quickly and fixed. But when the float value looks reasonable enough and doesn’t violate some kind of system invariant, it can feed you bad data for a very long time before someone catches it.
September 12-13.<p>An oceanographic research ship is doing gravity and magnetic surveys off the coast of Brazil.<p>Suddenly, data acquisition software crash!<p><pre><code> byte day_of_year;</code></pre>
I’m very happy that the flight software codebase I’m currently working on doesn’t use any floating point. We don’t even have FPUs enabled. Then again, it’s not GN&C do the stakes are not as high.
I hate these “single line of code did X” type headlines.<p>It will <i>always</i> be a single line of code. The nature of most programs is to execute commands in a sequence. Eventually you hit one that fails.<p>Hell, you could reduce it to even be less than a line of code. It could be a single variable. A single instruction. It could be a couple bits. A couple bad 1’s and 0’s in memory blew up a multibillion dollar rocket launch.
> "To achieve this, the guidance system converts the velocity readings, from 64 bit floating point to 16 bit signed integer".<p>Oh, excellent possible interview question? "Write some code that reliably converts the full range of possible 64 bit floating point values to a 16 bit signed integer. What are the issues you'll have to deal with and what edge cases might arise?"
Why would the program react like that to a SINGLE wrong signal that disagrees with everything else and produce a signal that cannot do anything good in any circumstances? This just smells like a truly naive piece of implementation.<p>There should be layers upon layers of safeties to prevent this dumb thing from happening. The computer should know the position, orientation and velocity of the rocket at any point in time and new signals should be interpreted in the context of what the computer already knows and in context of what other sensors are telling. It is not like the rocket can turn itself around in 1ms and if it does there probably isn't much it can do anyway.<p>This suggests to me the problem is not the bug, it is the overall quality of development.
Even very earth tied machines suffer similar issues. I worked on what was known as a "hot leveller" computer, a PDP-11/73 at the steelworks I was employed at. It had something like 9 rolls (maybe 200mm in diameter) that would be applied to a very hot steel plate (maybe 10mm to 150mm thick) after it had been rolled from a maybe 300mm thick slab.<p>The levellers job was to smooth out any waves that might have acquired during the rolling process - almost like a clothes iron. The gap between the rolls needed to be adjusted by hydraulically positioning backup rolls that are even able to bend those work rolls across their width (maybe 3000mm). As you are always intending apply a huge amount of force anyways, to achieve the desired results, it was a mix of metallurgical driven algorithms and hard limits to doing the "setup".<p>While there was always an operator that had to accept the setup before the run, there was always the risk of hitting the machines surfaces too hard, and straining components, and maybe causing a prolonged as expensive outage. Obviously the biggest risks were when there were changes or even experiments by both engineers and metallurgists. It was fun times as a quite junior engineer, and think there were a few times when over zealous setups resulted in some big noises. But I don't think I broke anything fortunately.
How can you write a whole article about "a single line of code" and not have that line appear anywhere in the article?<p>Even worse, why was I completely unsurprised, nay expecting this to be the case when I clicked?<p>(to the article's credit, it didn't quite start in the typical "George was walking his dog home when he noticed something wrong" fashion...)
I think James Gleick had a much better write up about this: <a href="http://www.maths.mic.ul.ie/posullivan/A%20Bug%20and%20a%20Crash%20by%20James%20Gleick.htm" rel="nofollow">http://www.maths.mic.ul.ie/posullivan/A%20Bug%20and%20a%20Cr...</a>
It'd probably be more accurate to say that a technology environment which allowed any single line of code to cause catastrophic failure is what brought down the launch. Or a failure of sufficiently accurate testing brought down the launch.
The world is not ready for the "Epochalypse", let's see what will collapse first, our civilization or our computers<p><a href="https://en.wikipedia.org/wiki/Year_2038_problem" rel="nofollow">https://en.wikipedia.org/wiki/Year_2038_problem</a>
>> However, the reading is larger than the biggest possible 16 bit integer, a conversion is tried and fails. Usually, a well-designed system would have a procedure built-in to handle an overflow error and send a sensible message to the main computer. This, however, wasn’t one of those cases.<p>This is so unbelievably untrue. I've never seen code anywhere that waits to fail before doing the right thing.<p>This is exactly why I think exceptions are mostly useless, someone has to anticipate the problem, so why not write something that works right the first time. There are cases where exceptions can happen, but I don't think floating point arithmetic should be considered one of those cases.