Taking too much slack out of the rubber band

143 pointsby r4umover 5 years ago

17 comments

roland35over 5 years ago

I see this kind of thinking all the time in hardware engineering as well, and it all boils down to premature optimization. Cost almost always is driving this.One example is a recent project was a very cost-sensitive machine in which a small heater was copied over from another product, but no one actually verified that it was good to the required limits (just the default use case). Well, turns out it wasn't quite powerful enough but it is way too late and expensive now to fix it at the end! Also, all the engineering time was wasted to figure this out (but it often seems management doesn't count engineering time the same way as parts cost)!I've since learned that in the beginning of a project it is critical to identify the riskiest parts of the design and try to isolate that to a module and over-spec it, hopefully with a path to reduced cost later on. But the most important thing I've learned is don't try to solve tomorrow's problems today!

评论 #21505459 未加载

amalconover 5 years ago

I've spent quite a bit of time on a problem very similar to this. It's surprisingly challenging. Imagine this scenario:Some service has three units of capacity available (e.g. VMs). This is the minimum amount allowed, on the theory that things won't break too badly if one of them happens to crash. You target 66% CPU utilization. Suddenly, one goes down, and the software sees 100% CPU utilization on the other two. What should the software do?Well, the obvious thing is to add one more instance, assuming that one of them crashed and its load shifted to the other two. However, what if the thing that actually happened is that the demand doubled, and the load caused the crash? Then, you should probably add six more instances (assuming that the two remaining live ones are going to go down while those six are coming up).If you look at only CPU utilization, it's impossible to tell the difference between these two situations.

评论 #21510406 未加载

dehrmannover 5 years ago

This is even scarier in the physical world. Just-in-time logistics means companies aren't warehousing inventories as large as they used to. In the case of major events (natural disasters, terrorist, etc.), there isn't enough reserve supply to go around.

评论 #21507380 未加载

评论 #21507900 未加载

SideburnsOfDoomover 5 years ago

The dynamic scaling version of cascading failure

jesover 5 years ago

It’s important that systems have some design margin (buffers of one kind or another) so that a disruption / transient event in one part of the system is absorbed locally and not passed on to the rest of the system.

thaniriover 5 years ago

It seems like this problem is solved by simply setting a sensible minimum in an autoscaling group. And not "everyone on Earth was abducted by aliens and stopped using the service" levels of minimum.Say I'm an e-commerce site and on Black Friday I can see historically (or just make an educated guess if it's your first holiday sale) I get "n" requests per second to my service.I'll set my autoscaling group the day before to be able to handle that "n" number of requests, with the ability to grow if my expectations are exceeded. If my expectations are not met, then my autoscaling group won't shrink. Then the day after the holiday sale, I can configure my autoscaling group to have a different minimum.This solves the problem of balancing between capacity planning and saving money by not having idle resources running.If you're the type of person who hates human intervention for running your operation, then fine. Put in a scheduled config change every year before a sale to change your autoscaling group size.It's pretty rare to have enormous spikes in application usage without good reason. Such as video-game releases, holiday sales, startup openings, viral social media campaigns.

评论 #21508611 未加载

GauntletWizardover 5 years ago

I recently gave a talk at SRECon [1] about a partial solution: Using a PID controller. It won't solve all instances of this problem, but properly tuned, it will dampen the effect of these sudden events and quicken the response times to them.[1] <a href="https://www.usenix.org/conference/srecon19emea/presentation/hahn" rel="nofollow">https://www.usenix.org/conference/srecon19emea/presentation/...</a>

thunderbongover 5 years ago

Money quote - "Capacity engineering is no joke."

dgritskoover 5 years ago

> Of course, at some point, [...] the local service gets restarted by the ops team (because it can't self-heal, naturally)Maybe off-topic, but what are some good strategies for the kind of "self-healing" being talked about here? If a service needs to be restarted, how could you automate the detection and restart process?

评论 #21506828 未加载

patmcguireover 5 years ago

There's something related called the bullwhip effect. I think that throwing away requests under load rather than putting them in some overflow queue prevents it. The effects aren't magnified down the chain of services as each scales up because it's only incoming traffic.

jerkstateover 5 years ago

dynamically scaling down based on cpu consumption is the wrong way to do it IMO. if your site is decently sized you have a pretty typical diurnal pattern with weekly cyclical variation, that's your baseline.

评论 #21506850 未加载

评论 #21502921 未加载

insanebitsover 5 years ago

But if your service was down for more than what it takes to downscale to minimum scaling back up is not that big of an issue. It was down anyways. Also 24/7 instances exist for a reason, autoscaling is there to handle spikes, not a normal traffic.

评论 #21503074 未加载

DasIchover 5 years ago

That just means you should scale based on the work to be done rather than poor proxies such as CPU utilization. Also set a reasonable minimum and maximum based on observed load in production and review this as part of regular operational reviews.

svackoover 5 years ago

OT: can you update the link to use the https version of the site? The author has not implemented http->https redirect for the site yet.

评论 #21504556 未加载

评论 #21503150 未加载

diminotenover 5 years ago

Good edge case to consider when designing an auto scaling service, but now that I'm aware of it, I think I'll be able to design around the problem with some combo of the suggested solutions, and still get the autoscaling that I feel like the article was trying to convince me not to do...

评论 #21507037 未加载

tus88over 5 years ago

If scaling up is painful there is something wrong with the architecture. Aside from this scenario, what if you just get a spike in traffic? If your scaling solution can't handle it, get a better one, otherwise what's the point?

评论 #21502839 未加载

tqkxzugoaupvwqrover 5 years ago

Useful anecdote to learn from but not the article I expected from reading the title. I was prepared to read a story about literal rubber bands.