Great stuff, and I love the concrete example of the ZK failure due to error logging -- a classic cascading failure mode. While it's true that I'm an inveterate disaster porn addict[1] and would therefore love this regardless, I think that Nathan's piece serves as a model in that it speaks to learning from failure rather than gloating about nascent success -- we collectively need much more of this! I also like that Nathan doesn't romanticize other engineering domains, as naive software engineers are wont to do; other engineering domains also struggle with failure -- it's just that their failures are so much more public (and so much more likely to involve loss of property and/or life) that they cannot evade collective introspection the way software engineering so frequently seems to. Very much looking forward to Part 2!<p>[1] <a href="http://www.infoq.com/presentations/Debugging-Production-Systems" rel="nofollow">http://www.infoq.com/presentations/Debugging-Production-Syst...</a>
I think this quote is magic, "Software engineering is a constant battle against uncertainty – uncertainty about your specs, uncertainty about your implementation, uncertainty about your dependencies, and uncertainty about your inputs."<p>Engineering is about handling what goes wrong, not what goes right. It's about handling the errors, changes, misuse, etc. It isn't about the techniques per say, as much as the mindset of living in an imperfect world.<p>[Edit: Fixed a typo.]
Super interesting post!
I'd have mentioned Unit tests as another measure to tackle uncertainty. Simple, boring unit tests (reminds me of this post[1]).
Maybe he just assumes those will exist when professional engineers write code. [2]<p>[1] <a href="http://robertheaton.com/2013/04/01/check-youre-wearing-trousers-first/" rel="nofollow">http://robertheaton.com/2013/04/01/check-youre-wearing-trous...</a>
[2] <a href="http://www.amazon.com/Clean-Coder-Conduct-Professional-Programmers/dp/0137081073" rel="nofollow">http://www.amazon.com/Clean-Coder-Conduct-Professional-Progr...</a>
One question this raised (and I don't mean this as a gotcha): why could a flood to the error-reporting servers take down <i>all</i> of the applications? I expected the primary fix to be to decouple the work so it could continue with no error reporting server. (But I'm not familiar with Zookeeper or any of the other work the author's doing, beyond reading some post on Storm.)
There is a fine line between industry (cost center vs generating revenue) and startups when it comes to discussing the term software engineering.<p>I see a large amount of legacy maintenance in cost center based programming. Revenue generating industry channels seem to favor the enterprise aspect of software engineering. Startups attempt to just build, and fix as necessary (cowboy). Yet, each has their own facet of software engineering.<p>I am still trying to draw the line between too-enterprisy, too-maintenancy, and too-cowboy. At my current job, we assume everything is certain. The uncertainties are not coded for, because everything is internal. This bothers me to a large extent. I love coding for the uncertain. Giving more control to the user and automating a whole department is right up my court. Sadly, it is hard to convert people. Only the 'RU' in CRUD is in the user's hands most of the time. It is pure legacy fear.<p>The removing cascading failures part needs more emphasis. Remove portions from your cycle/automation/jobs. What happens? I also agree with the measure and monitor portion. Waiting to create analyzers and looking at metrics once the program starts breaking in production is too late.<p>Looking forward to the next posts.
This may be semantics, but I think of software engineering as the slightly larger scope of building real-world solutions with software and hardware. Civil engineering is not (just) about mixing the right cement and letting it cure at the right temperature for the right length of time, nor is it strictly about building a bridge, it's about building a bridge for the right price in the right amount of time that will last a given number of years, all parameters which were determined through a careful process and making decisions with stakeholders, while applying scientific principles (geology, materials science, etc.) and good people management skills. Oh, and the successful bridge project leaves behind the documentation of the bridge as built and a structure to assure its proper maintenance.<p>However, I do agree that handling the huge and complex range of inputs, not only the expected ones, is a great beginning to the process, one that is often overlooked. And same goes for internal monitoring, to make sure your system is still functioning as designed.