TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

My Friday Night With AWS

30 pointsby BenjaminCoealmost 13 years ago
Brain dump of my thoughts about how we handled/recovered from the AWS outage last night.

5 comments

inopinatusalmost 13 years ago
In most contexts, Disaster Recovery is not the same as High Availability is not the same as Fault Tolerance.<p>So, in this context, if your devops crew is on the ball, then the first warning in this article:<p><i>The only way to ensure close to 100% up time is replicating your entire infrastructure. Infrastructure costs will more than double ...</i><p>is mercifully untrue in the majority of cases.<p>Why? Because unless the major component of your infrastructure cost is storage, or your Recovery Point Objective (RPO) is zero, then database log shipping and bulk data sync to another region isn't all that expensive.<p>The author may be assuming that you'd need to have the VMs ready to go at the standby region. This isn't true, not when you can boot a large application cluster and promote/upgrade a database replica in minutes. For the majority of businesses, a realistic Recovery Time Objective (RTO) is on the order of minutes to hours, so this is fine.<p>I built this recently. A booking system for an airline. Works as intended. Failover time is under five minutes. Enabling this is repeatability of deployment, which is an outcome of careful tooling. The application itself was developed by an agile &#38; TDD-centric team which made for an easily transplanted app.
评论 #4183681 未加载
alanhalmost 13 years ago
While I appreciate anyone taking the time to share their thoughts, I also find it very distracting that nearly every sentence contains some sort of grammatical, orthographic, or structural error.<p>Does this make me the grammar police, or do I have a valid complaint?<p><i>Update.</i> Putting my time where my mouth is: Next time someone has a real time crunch (as Coe notes at the end) but wants to publish a helpful post in a timely manner, contact me with a draft or CMS credentials and I’ll take at least a quick look. Expect no miracles, but I will catch obvious errors.<p>I also keep wishing I could send pull requests to bloggers with suggested edits.
nothackeralmost 13 years ago
Redundancy wasn't the problem I saw last night. What I saw, at least with Heroku, is that when I checked, <i>the main Heroku site was down and displaying things like nginx errors</i>. That to me is unacceptable for an operation such as theirs. Even if all hell is breaking loose, you don't <i>only</i> keep your status page up for all to see, you have a pretty damn good message up that the main page resolves to. I'm not saying they screwed the pooch entirely as I'm sure they were busy, but, damn it, even Amazon is going to go down sometimes. Screw redundancy if you can't even serve a webpage to inspire confidence that you are working on it. I'm sorry I'm picking on Heroku specifically, because I'd be really f'n surprised if a lot of you weren't in the same boat. You <i>need</i> to have the main page served when that happens, even just a static page that inspires confidence or direct to the blog and provide updates there.
BenjaminCoealmost 13 years ago
The first indicator that it was going to be a long Friday night was our EC2 hosted Minecraft server tipping over. Nagios alerts followed. This is a brain dump of some of my thoughts about AWS, and a discussion of how we got back on-line quickly.
评论 #4183295 未加载
talonxalmost 13 years ago
A sober post with good advice compared to most of the rants about the outage that are now on HN.<p>It's also telling about our need for sensationalism that those rants have more comments than this article!