It baffles me that AWS, a leader in cloud computing can make such a rudimentary mistake. Seriously, I interviewed there and they asked me to write a b+ tree and I failed. And then you see fundamental errors like this which possibly cannot be made by people who had the smarts to write b+ trees in 15 minutes...<p>I want to take this opportunity to complain about the interview system. Hire people who care about the product and company. Such mistakes cannot be made by people who care.
So now you know that a deadman switch is the better way to report availability. The logic was backwards for this signal. The default condition is failed. Not failed requires proof.<p>It's interesting how easy it is to accidentally invert logical operations. I see it in code all the time. A condition will test that A is true when what they really need to know is if B and C are both false. It's like some kind of cognitive tick.
This is a problem of "monoculture" dependencies and failure to implement HA by using multiple services. All Github releases are down, atom downloads are down and so on. Companies, including Amazon, should be using other CDNs for HA purposes, even if NIH.<p>It's a similar mistake of making DNS a dependency for monitoring/control infrastructure when DNS is down.
> The dashboard not changing color is related to S3 issue.<p>I don't understand this. The icon URL is in the HTML. Both icons <a href="https://status.aws.amazon.com/images/status0.gif" rel="nofollow">https://status.aws.amazon.com/images/status0.gif</a> and <a href="https://status.aws.amazon.com/images/status3.gif" rel="nofollow">https://status.aws.amazon.com/images/status3.gif</a> have been working for us all along. Plus clearly they are able to update the status page contents, because they added the "increased error rates" message there too. I don't want to believe it but is it fair to assume they did not want to replace status0.gif with status3.gif in HTML? Please correct me if I'm not getting this straight.<p>In any case, it's a bad day for AWS folks, I'm feeling their pain too. Being a cloud provider is a tough business to be at and the pressure is really high.
So the obvious answer would be to host it on like azure or google cloud storage but I can just imagine the institutional push back that would get trying to do that.
Just to be clear...best practices with designing status pages:<p>1. ensure it does not depend on your infra (if your api server goes down - it should not take down your status api with it)<p>2. make sure your service reports to your status page instead of your status page looking for the service.<p>3. redundancy for your status page?<p>anything anyone-else wants to add?