How to sleep at night having a cloud service: common architecture do's

382 pointsby dshackerover 5 years ago

17 comments

One thing missing here is to avoid synchronous communication. Sync comms tie client state to server state; if the server fails, the client will be responsible for handling it.If you use queue-based services your clients can 'fire and forget', and then your error handling logic can be encapsulated by the queue/ consumers.This means that if you deploy broken code rather than a cascading failure across all of your systems you just have a queue backup. Queue backups are also really easy to monitor, and make a great smoke-signal alert.The other way to go, for sync comms, would be circuit breakers.My current project uses queue-based communications exclusively and it's great. I have retry-queues, which use over-provisioned compute, and a dead-letter for manually investigating messages that caused persistent failures.Isolation of state is probably the #1 suggestion I have for building scalable, resilient, self-healing services.100% agree with and would echo the content in the article, otherwise.edit: Also, idempotency. It's worth taking the time to write idempotent services.

评论 #21518533 未加载

评论 #21517905 未加载

评论 #21519794 未加载

评论 #21519721 未加载

encodererover 5 years ago

I like guides like this that can help beginners bridge the gap between hobby and professional quality development.I’ll add one more tip, the one I think has saved me more sleep and prevented more headache than any other as I’ve developed a SaaS app over the last 5 years.It’s simple: Handle failure cases in your code, and write software that has some ability to heal itself.Here are a few things I’ve developed that have saved my butt over the years:1) An application that is deployed alongside the primary application, tails error logs and replays failed requests. (Idempotent requests make this possible)2) many built-in health checks like checking back pressure on queues and auto-throttling event emitters when queues get backed up3) Local event buffering to deal with latency spikes in things like SQS.I hope to eventually write more about these systems on our blog but I never seem to find the time

评论 #21517570 未加载

评论 #21517423 未加载

评论 #21517533 未加载

rubyn00bieover 5 years ago

You know the one thing that has helped me out the most, an error reporting service AND then addressing _every_ error.That is to say, my service should emit zero 500 errors.Then my reporting is easy to interpret and consistently meaningful. I don't have to worry about bullshit noise "oh that's just X it does that sometimes."Sleeping at night is a lot easier when you have less keeping you awake.

评论 #21520117 未加载

评论 #21522318 未加载

评论 #21518458 未加载

savrajsinghover 5 years ago

Just about everything mentioned here is well-handled by Google App Engine. I still think it’s the way to go for most projects, but I don’t think they’ve marketed themselves well lately. I’m sure there are other good providers too; I don’t see the downside to using PAAS.

评论 #21518982 未加载

评论 #21518386 未加载

bcrosby95over 5 years ago

> A 4 9’s means you can only have 6 minutes down a year.4 9's is 52 minutes of downtime a year. Keep in mind that single region EC2 SLA is only 99.99%. And if you rely on a host of services with an SLA of 99.99, yours is actually worse than 99.99. So if you want to actually get to 99.99, your components have to be better than this, meaning you will have to go multi-region. So achieving this is actually way harder than this simple step.

评论 #21517325 未加载

评论 #21519097 未加载

评论 #21516948 未加载

jto1218over 5 years ago

I'd recommend using an APM product off the shelf to get a lot of the mentioned functionality in the article (Monitoring, Tracing, Anomaly Detection). I would definitely _not_ recommend trying to roll all that yourself, unless you have a ton of time and resources.There's a few good ones out there, we use Instana and it's working really well.

synackover 5 years ago

This is all good advice for the app tier, but in my experience the most painful outages relate to the data store. Understand your read/write volume, have a plan for scaling up/out, implement caching wherever practical, and have backups.

评论 #21518078 未加载

jturpinover 5 years ago

Good article. I would add one thing to this - pick a database that scales horizontally and is distributed. CockroachDB, Elasticsearch, Mongo, Cassandra/Scylla are all good choices. If you lose one node, you don't have to be afraid of your cluster going down, meaning you can do maintenance and reconfiguration without downtime. If your load is low or bursty you can even get away with running these on some small servers such as t3 (probably minimally t3.larges). Running a cloud managed database is also a good option.

评论 #21518781 未加载

z3t4over 5 years ago

Having only one mirror is scary. If one goes down, its like murphy's law kicks in. So you want at least 3 things to go wrong in order to take down your system, 2 is not enough. Also have redundancy everywhere if your checker agent stops working for example. You want 2 of everything and at least 3 of those that should never fail.

haolezover 5 years ago

As a solo founder, I have almost everything mentioned in this article set up, except CI/CD. I can certainly see its value, but being able to easily take down parts of my production system and replace them with instrumented variants is very useful to me when things go wrong. I find that CI usually gets in the way of this. Maybe it's just a bad habit that I need to ditch :)

评论 #21518101 未加载

评论 #21517065 未加载

评论 #21521995 未加载

评论 #21520420 未加载

corentin88over 5 years ago

Haven’t seen anything related to third-parties service that your cloud service relies on. I’m talking mostly about APIs that you can use that might crash at some time. Any recommendations on that part?

luordover 5 years ago

This is a great list. I feel a little happy with myself that I knew about most of these.Except for identifying each request, I had never heard about that. It's so simple yet so brilliant, gotta start doing it.

swader999over 5 years ago

Good article. Fire drills are worthy of mention. Simulate parts going down, practice recovery.

fcvarelaover 5 years ago

Nice article, may I ask what tools you used to produce the illustrations?

评论 #21517920 未加载

jupp0rover 5 years ago

Pretty funny that HN traffic seems to have killed the site.

vishaalkover 5 years ago

Great article Sada :). Hope OneNote is treating you well!

评论 #21518410 未加载

peterwwillisover 5 years ago

I applaud the author for sharing their notes. But also, this is why HN (and general upvote-anything-that-looks-interesting forums) sucks. If you are actually defining architecture, you should not be reading these kind of blog posts. I get that they are interesting to the layman, but so is The Anarchist's Cookbook. Don't make whatever you read in The Anarchist's Cookbook.And I'm crabbing about this because I am easily susceptible to Anarchists Cookbooks. I have had to implement X tech before, and googled for "How do I X", and some blog post came up saying "For X, Use Y". I'm too lazy to read 5 books on the general concept, so I just dive in and immediately download Y and run through the quick-start guide. After spending a while getting it going and getting past the "quick start", I wonder, "Ok, where's the long-start? What's next?" And that doesn't exist. And later, after a lot of digging, it turns out Y actually really sucks. But the blog post didn't go into that. I wasted my time (my own fault) because I read a short blog post.A lot of people live by Infrastructure as Code, and so they will reach for literally anything which has that phrase in its description. But you don't need it to throw together an MVP, and a lot of the IaC "solutions" out there are annoying pieces of crap. I guarantee you that if you pick any of them up, you are in for months of occasionally painful edge cases where the answer to your problem is "You just weren't using it the right way."In reality, if you want to be DevOps (yes, I'm using DevOps as an adjective, ugh) you should probably develop your entire development and deployment workflows by hand, and only when you've accomplished all of the basic requirements of a production service by hand (bootstrapping, configuration, provisioning, testing, deployment, security, metrics, logging, alerts, backup/restore, networking, scalability, load testing, continuous integration, immutable infrastructure & deployments, version-controlled configuration, documentation, etc), then you can start automating it all. If you've done all of these things before, automating it all from the start may be a breeze. If you haven't, you may spend a ton of time on automation, only later to learn that the above need to be changed, requiring rework of the automation.

评论 #21518912 未加载

评论 #21519635 未加载