Google: <i>"An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free."</i><p>"Partial refund". That's a very low standard for a service level agreement, but typical of Google. Your whole business is down, it's their fault, and all you get a partial refund on the service.<p>A service level agreement is really a service packaged with an insurance product. The insurance product part should be evaluated as such - does it cover enough risk and is the coverage amount high enough? You can buy business interruption insurance from insurance companies, and should price that out in comparison with the cost and benefits of a SLA. If this is crucial to your core business, as with an entire retail chain going down because a cloud-based point of sale system goes down, it needs to be priced accordingly.<p>See: [1]<p>[1] <a href="https://www.researchgate.net/publication/226123605_Managing_Violations_in_Service_Level_Agreements" rel="nofollow">https://www.researchgate.net/publication/226123605_Managing_...</a>
This is a great article for defining terms. For some reason though, this quote made me laugh out loud:<p>"Excessive availability can become a problem because now it’s the expectation. Don’t make your system overly reliable if you don’t intend to commit to it to being that reliable."
If you're building a system from scratch, keep in mind that this way of designing your service may not be flexible enough. You don't want just service level objectives, agreements and indicators, you want <i>customer level</i>.<p>Your service may end up providing for multiple customers with different requirements. Maybe 1% of your customers will end up using 99% of your resources, creating uncomfortable situations that affect the other 99% of customers. To get away from this you have to start spinning off multiple identical services just for groups of customers, which is really annoying to maintain. You may find you need to add hard resource limits to control customer behavior, which is hard to add after the fact.<p>Instead, if you design your new system from scratch with customer-specific isolation and service levels, you can run one giant service and still prevent customer-specific load from hampering the rest of the service. You can also just run duplicate services at different levels of availability based on customer requirements, but that's not going to work forever.<p>As an aside, I'm looking forward to reading ITIL 2019 to see what new processes they've adopted. I think everyone who's getting into SRE stuff should have a solid foundation on the basics of IT Operations management first.
In ops, we often have other internal groups that we either work with or support. It's often useful to view these groups as a customer, then you use the same policies, perhaps with a few exceptions in some cases, to manage the relationship. Typically we call this the OLA, the operating level agreement. I can only speak for my own experience, but operations groups I've been part of that don't have this concept of the operating level agreement typically suffer various types of damage to reputation. This is because there are no rules around how internal groups assess accountability, and therefore by having the terms of the OLA, you have the ability to defend your position as long as you stayed within the terms of the OLA. For example when we started building VAData data centers all over the world for Amazon, by having an OLA, we were able to push back on groups that claimed we were not holding up our end of the agreement.
When reading these articles, never forget that your company is NOT Google! If your company doesn't have a management/infrastructure/communication/skill structure that Google has, then it will be very difficult to implement these fundamentals.<p>In many cases, an SRE is a job to save costs. If your company doesn't get its shit together and doesn't give your SREs the support it needs, then they'll hate their jobs and the company.
These distinctions started making more sense when I realize they map to OKRs which is generally how Google is said to track individual and team performance.<p>In general, it's good to be precise about how you measure and when something is a hard or soft boundary. Otherwise, firefighting gets out of control. It's hard to determine when to stop something and put out a fire if you can't prioritize issues based on the boundaries you've set for your system.
So, how do you choose that service level objective? How do you know which solutions to implement to not make things "overly reliable"? Isn't that more important question? As doing this without some sort of methodology will almost always result in useless solutions and overpaying to cloud and other hosting providers. Like implementing rather expensive failover within the datacenter, while ignoring how unreliable datacenters are and how cheaply you can implement failover between datacenters via DNS.<p>I like the idea of modelling availability/reliability for this. Even if you don't have the right numbers and do it on a napkin, not in code, it still can highlight solutions with best cost/benefit ratios.
So there is one obscure metric "service is available, i.e. can do its job", and this metric has different attributes: there are actual metric values (SLIs), there are internal goals (SLOs) and there are legally binding promises (SLAs) to users/customers. I would argue that this is not much content here.<p>Content, imo, would be something like this: We define "available" as "processor_load<99% and disk_load<99% and ram_load<99% and server responds with http 200 on port xyz", because reason_a, reason_b, reason_c. But other people could argue that it is not as much about the node but about how service_x is experienced, so one could track the speed of http responses to user requests and they should be under 0.1sec over 95% of the time. etc...<p>That you should track metrics, that you should set goals, and that you should define SLAs with your customers/users is standard business practice, not new knowledge.