tl;dr<p>1. On February 29, 2012, new certificates created with a one-year expiration date by adding 1 to the year. Since February 29, 2013 is an invalid date, VMs wouldn't start.<p>2. After multiple attempts to restart failed VMs, physical hosts marked as failed, and VMs migrated to other physical machines -- the problem propagates.<p>3. Management services disabled to prevent customers from starting more VMs, compounding the problems.<p>4. After leap-day bug fixed, secondary failures caused by mixing up incompatible versions of a networking plugin, so VMs had no network access.<p>5. Total duration of outages: about 16 hours.<p>6. 33% of a month's service to be credited to all customers, regardless of who was affected.
<i>cough</i> <a href="http://thedailywtf.com/Articles/DATE_NOT_FOUND.aspx" rel="nofollow">http://thedailywtf.com/Articles/DATE_NOT_FOUND.aspx</a><p>And this is why you always use your framework's or language's date arithmetics library and never try to hack up a solution on your own. Date calculations alone are hard enough with the basic irregularities of month lengths. Add the leap years and it becomes even harder.<p>And don't get me started on times, especially once time zones and summertime comes into play.<p>Likely your particular hacked-together solution will fail at some point. And if it doesn't: was it worth all the effort you put into making it perfect, especially considering that somebody has already done it for your framework.<p>NIH at its finest.
How do you all generally handle leap days when doing time math? If you're selling a service for one year, are you selling 365 days (02/28/12 - 02/26/13) or do you just give away the leap day for free (02/28/12 - 02/27/13)? Do you pay your salaried employees one day extra on a leap year?<p>What other leap year bugs have people run into? Generally the libraries I work with (e.g. python's timedelta) don't let you add months or years because of their ambiguity.
Working at Microsoft (in Windows Azure), this was the first outage since I joined the org, so I did not know what to expect from the company in terms of transparency on this outage. However, given other presentations or papers on the Windows Azure technology and how open they were publicly, I expected a good job here.<p>Bill Liang's post confirmed how transparent Microsoft wants to be with its customers, what is really nice. And I appreciate how seriously Microsoft is attempting to learn from these incidents and putting measures in place.
The article really is worth a read if you build complex systems. My takeaway from this is that you shouldn't schedule maintenance work during "weird" times.<p>Had they not been deploying new code on leap day (UTC), the outage would have been substantially less severe. Code that uses dates and times will have bugs, because it's hard. Don't complicate things further.<p>So from now on, no more leap day, daylight savings time, or new years maintenance. It's worth postponing a day just in case.