TechEcho

1 comment

rbcover 12 years ago

Here's my summary of their article. The whole event occurred because a developer ran an ELB purge job, thinking they were purging a non-production ELB meta-data database. It turned out that the configuration they were using, purged production ELB meta-data instead. As they say, no good deed goes unpunished. This caused strange errors that confounded the ELB technical team, delaying rapid recovery.<p>The rest of the Amazon article details the recovery process and the after-action items. Apparently, they had to recover at least twice, because the first recovery attempt failed. They had to figure out how to recover the ELB meta-data. Apparently this database didn't have a working manual procedure or automation for recovery.<p>One key after-action item deals with production access and is worthy of note. Privileged access to production was being provided to a small team of developers, working on ELB automation projects. The privileged access was "persistent" and didn't require per-access approval. On Christmas Eve, that was a problem. Amazon is promising that each privileged access now requires permission. They also claim that future recoveries will be faster, because they understand them better. As a final note, they apologized for the outage. Live and learn, I suppose.

Summary of the December 24, 2012 Amazon ELB Service Event

1 comment

Summary of the December 24, 2012 Amazon ELB Service Event

1 comment