SRE Fundamentals: SLIs, SLAs and SLOs

287 pointsby nealmuelleralmost 7 years ago

12 comments

Animatsalmost 7 years ago

Google: "An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free.""Partial refund". That's a very low standard for a service level agreement, but typical of Google. Your whole business is down, it's their fault, and all you get a partial refund on the service.A service level agreement is really a service packaged with an insurance product. The insurance product part should be evaluated as such - does it cover enough risk and is the coverage amount high enough? You can buy business interruption insurance from insurance companies, and should price that out in comparison with the cost and benefits of a SLA. If this is crucial to your core business, as with an entire retail chain going down because a cloud-based point of sale system goes down, it needs to be priced accordingly.See: [1][1] <a href="https://www.researchgate.net/publication/226123605_Managing_Violations_in_Service_Level_Agreements" rel="nofollow">https://www.researchgate.net/publication/226123605_Managing_...</a>

评论 #17570018 未加载

评论 #17570561 未加载

评论 #17569946 未加载

评论 #17570638 未加载

评论 #17573416 未加载

评论 #17572231 未加载

pspeter3almost 7 years ago

This is a great article for defining terms. For some reason though, this quote made me laugh out loud:"Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable."

评论 #17568989 未加载

评论 #17568909 未加载

评论 #17568901 未加载

评论 #17568691 未加载

评论 #17568928 未加载

评论 #17568659 未加载

peterwwillisalmost 7 years ago

If you're building a system from scratch, keep in mind that this way of designing your service may not be flexible enough. You don't want just service level objectives, agreements and indicators, you want customer level.Your service may end up providing for multiple customers with different requirements. Maybe 1% of your customers will end up using 99% of your resources, creating uncomfortable situations that affect the other 99% of customers. To get away from this you have to start spinning off multiple identical services just for groups of customers, which is really annoying to maintain. You may find you need to add hard resource limits to control customer behavior, which is hard to add after the fact.Instead, if you design your new system from scratch with customer-specific isolation and service levels, you can run one giant service and still prevent customer-specific load from hampering the rest of the service. You can also just run duplicate services at different levels of availability based on customer requirements, but that's not going to work forever.As an aside, I'm looking forward to reading ITIL 2019 to see what new processes they've adopted. I think everyone who's getting into SRE stuff should have a solid foundation on the basics of IT Operations management first.

asn1parsealmost 7 years ago

In ops, we often have other internal groups that we either work with or support. It's often useful to view these groups as a customer, then you use the same policies, perhaps with a few exceptions in some cases, to manage the relationship. Typically we call this the OLA, the operating level agreement. I can only speak for my own experience, but operations groups I've been part of that don't have this concept of the operating level agreement typically suffer various types of damage to reputation. This is because there are no rules around how internal groups assess accountability, and therefore by having the terms of the OLA, you have the ability to defend your position as long as you stayed within the terms of the OLA. For example when we started building VAData data centers all over the world for Amazon, by having an OLA, we were able to push back on groups that claimed we were not holding up our end of the agreement.

评论 #17574227 未加载

bpchapsalmost 7 years ago

When reading these articles, never forget that your company is NOT Google! If your company doesn't have a management/infrastructure/communication/skill structure that Google has, then it will be very difficult to implement these fundamentals.In many cases, an SRE is a job to save costs. If your company doesn't get its shit together and doesn't give your SREs the support it needs, then they'll hate their jobs and the company.

评论 #17568482 未加载

评论 #17569596 未加载

评论 #17569018 未加载

strmpnkalmost 7 years ago

These distinctions started making more sense when I realize they map to OKRs which is generally how Google is said to track individual and team performance.In general, it's good to be precise about how you measure and when something is a hard or soft boundary. Otherwise, firefighting gets out of control. It's hard to determine when to stop something and put out a fire if you can't prioritize issues based on the boundaries you've set for your system.

评论 #17568695 未加载

zzzcpanalmost 7 years ago

So, how do you choose that service level objective? How do you know which solutions to implement to not make things "overly reliable"? Isn't that more important question? As doing this without some sort of methodology will almost always result in useless solutions and overpaying to cloud and other hosting providers. Like implementing rather expensive failover within the datacenter, while ignoring how unreliable datacenters are and how cheaply you can implement failover between datacenters via DNS.I like the idea of modelling availability/reliability for this. Even if you don't have the right numbers and do it on a napkin, not in code, it still can highlight solutions with best cost/benefit ratios.

评论 #17570740 未加载

erikbalmost 7 years ago

So there is one obscure metric "service is available, i.e. can do its job", and this metric has different attributes: there are actual metric values (SLIs), there are internal goals (SLOs) and there are legally binding promises (SLAs) to users/customers. I would argue that this is not much content here.Content, imo, would be something like this: We define "available" as "processor_load<99% and disk_load<99% and ram_load<99% and server responds with http 200 on port xyz", because reason_a, reason_b, reason_c. But other people could argue that it is not as much about the node but about how service_x is experienced, so one could track the speed of http responses to user requests and they should be under 0.1sec over 95% of the time. etc...That you should track metrics, that you should set goals, and that you should define SLAs with your customers/users is standard business practice, not new knowledge.

alttabalmost 7 years ago

"Within Google, we implement periodic downtime in some services to prevent a service from being overly available."Uh..... what?

评论 #17569332 未加载

评论 #17569969 未加载

评论 #17569619 未加载

评论 #17569738 未加载

评论 #17569499 未加载

评论 #17572661 未加载

ProAmalmost 7 years ago

This is an interesting article from a company that has almost nil customer support.

评论 #17568281 未加载

评论 #17568276 未加载

评论 #17568899 未加载

评论 #17568726 未加载

评论 #17568917 未加载

评论 #17568546 未加载

评论 #17569414 未加载

insiderinsideralmost 7 years ago

Getting the definitions right

saywatnowalmost 7 years ago

Does Site Reliability include using assets from no less than 7 domains and requiring Javascript to present a few paragraphs of text?

评论 #17577779 未加载