A rough guide to keeping your website up through catastrophic events

120 点作者 fredsters_s将近 13 年前

17 条评论

birken将近 13 年前

I think one factor people don't consider enough is the tradeoffs you need to make in order to make your app have incredible reliability. This article and other ones talk about a bunch of work you can do to help ensure you application stays up in rare events. However, maybe for you particular product, having 99.5% uptime and offering a ton of features is going to help you become more successful than 99.9% uptime.When you are a Google or a Twitter or an Amazon, you lose lots of money per minute of downtime, so economically speaking it makes sense for them to invest in this. However for an average startup, I don't think having a couple hours of downtime per month is actually going to be that big of a deal. Of course you need to ensure your data is always safe, and your machine configuration is also easy to deploy (via AMIs or something like puppet) so you have the ability to get back up and running in catastrophic cases, but at the end of the day having a good "We're currently having technical issues" screen could very well be a better investment than shooting for a technical setup that can still be up during catastrophic events.

评论 #4182678 未加载

评论 #4182540 未加载

评论 #4182544 未加载

WALoeIII将近 13 年前

You can't just put nodes in different regions, even with a database like MongoDB. It will work in theory, in practice you'll have all kinds of latency problems.WAN replication is a hard problem and glossing over it by waving your hands is a disservice to readers."Real" solutions are to run a database that is tolerant of partitioning, and have application level code to resolve the inevitable conflicts. Riak, Cassandra and other Dynamo inspired projects offer this. On the other hand you can use a more consistent store and hide the latency with write-through caching (this is how Facebook does it with memcached + MySQL), but now you have application code that deals with managing this cache.Either way you have to have very specific application code to handle these scenarios, and you may even run a combination of solutions for different types of data you need to store. There is no silver bullet, there is no framework or product that does it for you.

评论 #4182515 未加载

评论 #4182818 未加载

评论 #4182501 未加载

po将近 13 年前

Reading all of these post-mortems and guides to keeping your servers up and running, it strikes me how much AWS jargon is in there.The fact that so many developers have invested so much time into learning Amazon-specific technologies means that developers are left to deal with the problem within that worldview. Going multiple-datacenter means learning two of every technology layer.You could solve all of these problems using standard non-amazon unix tools, technologies, and products, however Amazon has enabled a whole class of development that makes it easier to just work within their system. It's easier to just wait for Amazon to figure it out for the general case and trust them than to figure it out and implement yourself.There are other risks with being the lone-wolf but for a lot of people, being in the herd has a certain kind of safety, despite the limitations.Not making a judgement call on it but it is something that I have noticed with these outages.

评论 #4183736 未加载

mnutt将近 13 年前

It seems like you can make a good tradeoff between Chef and AMIs by nightly rebuilding the AMIs off a fully configured system, and then when the machine comes up you run Chef to make up the incremental difference.

评论 #4182656 未加载

bifrost将近 13 年前

There's this really great thing called a CDN that can be used to keep your "Web Site" up at all times, even if your source servers are down. It doesn't help your web app, but its better than looking like you've dissapeared from the planet.

评论 #4182385 未加载

ccaum将近 13 年前

You can use Puppet to pre-bake your AMIs for you so you can scale very rapidly and still use configuration management to maintain your instances<a href="http://puppetlabs.com/blog/rapid-scaling-with-auto-generated-amis-using-puppet/" rel="nofollow">http://puppetlabs.com/blog/rapid-scaling-with-auto-generated...</a>

tillk将近 13 年前

This blog post is hilarious.A region in AWS speak is already multiple supposedly independent data centers (in AWS terms: AZ (availability zone)).So if an entire region fails, that's four or so data centers which all go down at the same time.So how many companies on bare metal have four data centers and experience this kind of catastrophic downtimes? Add to that, how many of these companies operate completely in the dark about which data center is actually which?These blog posts are annoying because it seems like these people have never done anything they suggest themselves.Yes, the cloud lets you setup a fully configured instance within minutes. But at which expense? Mostly intransparency about what the entire stack and what is going on?Food for thoughts.

评论 #4182693 未加载

评论 #4184299 未加载

rdl将近 13 年前

<a href="https://status.heroku.com/incidents/151" rel="nofollow">https://status.heroku.com/incidents/151</a>I assumed this was a response to the recent hella-long outage:"There are three major lessons about IaaS we've learned from this experience:1) Spreading across multiple availability zones in single region does not provide as much partitioning as we thought. Therefore, we'll be taking a hard look at spreading to multiple regions. We've explored this option many times in the past - not for availability reasons, but for customers wishing to have their infrastructure more physically nearby for latency or legal reasons. We've always chosen to prioritize it below other ways we could spend our time. It's a big project, and it will inescapably require pushing more configuration options out to users (for example, pointing your DNS at a router chosen by geographic homing) and to add-on providers (latency-sensitive services will need to run in all the regions we support, and find some way to propagate region information between the app and the services). These are non-trivial concerns, but now that we have such dramatic evidence of multi-region's impact on availability, we'll be considering it a much higher priority.2) Block storage is not a cloud-friendly technology. EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we've been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employs some of the best infrastructure engineers in the world: if they can't make it work, then probably no one can. Block storage has physical locality that can't easily be transferred. That makes it not a cloud-friendly technology. With this information in hand, we'll be taking a hard look on how to reduce our dependence on EBS.3) Continuous database backups for all. One reason why we were able to fix the dedicated databases quicker has to do with the way that we do backups on them. In the new Heroku PostgreSQL service, we have a continuous backup mechanism that allows for automated recovery of databases. Once we were able to provision new instances, we were able to take advantage of this to quickly recover the dedicated databases that were down with EBS problems."Then I checked the date. It's actually Heroku's response to their super-long April 2011 outage. Yet, it appears the "we should go across Regions" lesson wasn't learned.

ukd1将近 13 年前

Love AWS. We're using Heroku and it's been pretty painful over the last month or so. However, it's super easy. At some point it they should have an SLA, as there underlying hosting (aka AWS) provide one with credits for outages. The number one feature for Heroku that would help is being able to specify multiple zones or regions when making new apps.

评论 #4182566 未加载

talonx将近 13 年前

This article takes a too simplistic view of real world deployments on AWS, and attempts to sum it up with 5 bullet points. Yes, I know the title is a "rough" guide, but why not go into more depth and acknowledge that there's more diversity in terms of deployment models out there? The other option would have been to keep it very high level and not talk about specific tools.E.g. Use Route 53? Isn't that hosted on Amazon itself? Why create another point of failure?MongoDB - How many big sites on the cloud use it as their primary database?The only takeaway for me was the last paragraph - "In conclusion, you probably used a single zone because it’s easy (hey - so do we for now!). There will come a point where the pain of getting shouted at by your boss, client or customers outweighs learning how to get your app setup properly yourself."

mitchellh将近 13 年前

Regarding doing un-encrypted cross-datacenter replication with MongoDB, I recommend the author of this blog post read this: <a href="http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing" rel="nofollow">http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Comput...</a>

评论 #4182730 未加载

hoop将近 13 年前

Am I the only one who feels like the author missed the mark?The blog post is supposed to be about "keeping your website up through catastrophic events" and the dominant theme seems to be "invest more heavily on Amazon." IMHO, the exact opposite needs to happen.Sure, I understand that being in multiple regions means you supposedly have very autonomous deployments (to include Amazon's API endpoints), but nobody can prove to us that each zone or even each region are totally separate.I'm not saying that Amazon is being dishonest about their engineering - I simply believe that by being 100% reliant on a single vendor you fail to mitigate any systemic risk that is present. That risk can be technical risk or business risk, as engineering at this level isn't strictly a technical profession.

ehsanu1将近 13 年前

The article mentions using custom origins with CloudFront, and I don't understand why setting up origin.mydomain.com was required. At work, we use mydomain.com directly as the custom origin, and setup was super simple (just tell CloudFront the domain). Is there anything wrong with doing it this way?

评论 #4184261 未加载

fboule将近 13 年前

What about enhancing the autoscaling and Automated Configuration & Deployment and or AMIs part of the article with virtualization (ESXi with or without vMotion)? No need of configuration and deployment, only duplicate your VM to have it on the other site or let vMotion move it for you?

timothy2012将近 13 年前

Okay, so before all that, I guess the first crucial step is to have the site/system monitored by some third party such as Monitive or Pingdom. Then you can take action based on information (facts).Tim.

fredsters_s将近 13 年前

So before anyone else points this out - yes our app is currently hosted in a single zone, and no we do not plan on keeping it this way! (we're currently in early Alpha)

mbs348将近 13 年前

sounds hard

评论 #4182341 未加载