Client asks for 100% uptime

164 点作者 splattne超过 13 年前

31 条评论

oconnore超过 13 年前

I don't understand what the issue is. The client wants you to plan for disaster, and they aren't math oriented, so asking for 100% probability sounds reasonable. The engineer, as engineers are prone to do, remembered his first day of prob&stat 101, without considering that the client might not.When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down.Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent[1] three 9 reliability, with good failover modes, your expected downtime is under a second per year [2]. Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist.The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.[1] A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.[2] DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.

评论 #3057004 未加载

评论 #3057671 未加载

评论 #3056808 未加载

评论 #3056977 未加载

sunchild超过 13 年前

100% uptime is not an operational requirement – it's contractual. A client that demands 100% uptime isn't being unreasonable; they're looking for a contract remedy (most likely a termination right) if/when the site goes down.1. "Uptime" is defined in many, many ways. In the OP's article, it's the definition of uptime that seems unreasonable. Normally, the demarc points for the network segments and equipment being measured for uptime are entirely within the provider's control. In the OP article's update, the client clarified that 100% uptime only applies when hosting is cut over to the provider's site – something they are (theoretically) capable of controlling.2. Remedies for failing the uptime requirement are different for nearly every agreement. Often SLA credits are the exclusive remedy. Sometimes the customer has a termination right (either express, or through the termination for cause provision). The remedy is probably more important than the uptime percentage.You'd be surprised how many big name web apps offer 100% uptime as a matter of contract, knowing that it's a near-impossible operational goal. It's a matter of taking on the risk of your customer leaving you or claiming SLA credits, or whatever remedies you agree upon.

评论 #3057323 未加载

评论 #3058353 未加载

评论 #3058717 未加载

msy超过 13 年前

It's always helpful when clients give unambiguous signs of unreasonable insanity upfront instead of hiding it until you're halfway through the project. It makes running away as far and fast as humanly possible so much easier.

评论 #3057195 未加载

评论 #3057079 未加载

runako超过 13 年前

All the posters are stuck on the fact that 100% availability is impossible. But why not instead try to learn from others who offer 100% availability, like Rackspace and SoftLayer? These (legitimate) providers know 100% availability is not possible, but they guarantee it anyway. How can they get away with this? Easy, they have a contractual SLA that indicates what their clients are entitled to when their network fails for any period of time. Further, neither is a low-cost provider, which allows them to engineer their systems to reduce incidences when clients will invoke the SLA.Note that this doesn't mean that Rackspace is shady because they promise 100% knowing they can't deliver it. After all, they put their money where their mouth is! They have an incentive to actually achieve 100% uptime. I'm sure there are other applications where the target is 100% (not 5 nines) availability, especially in finance, medicine, and militaries.My recommendation would be to take your engineering hat off, replace it with a business hat, and provide them with a series of price quotes for various uptime SLAs. And then make sure you're pricing high enough that when something goes down for any period of time that you can make good on your obligations under the SLA without losing too much sleep. Then let the client choose the SLA that matches their business needs and budget.

joelhaasnoot超过 13 年前

Another option not mentioned in the thread is to accept it, and pay any of the fines associated with not meeting it. This happens all the time in public tenders and contracts, where the fines are calculated into the business risk. It does mean that the organization needs to set the right fines to make that unfeasible.

评论 #3056575 未加载

Duff超过 13 年前

Their craziness doesn't matter. Usually crazy customers aren't rich. So if you build to their craziness, you'll lose the customer.You need to build an appropriate infrastructure that will win the bid, figure out what you can achieve (99.9%/99.99% uptime) and build in enough overhead to cover your SLA penalties. Or negotiate a monitoring methodology that is in your favor. (ie. exclude planned maintenance windows, use a monitoring threshold/interval to allow you to address issues before triggering contract "downtime", exclude external provider issues, etc)

评论 #3056743 未加载

评论 #3058724 未加载

rwmj超过 13 年前

Come on, this is possible.First we're going to have to get the governments of the world together to agree to remove all nuclear weapons. Second would be getting that asteroid tracking and deflection system working. Quantum physics does unfortunately predict that the earth might flick out of existence with some small probability, but by distributing the website across the universe we can reduce this probability arbitrarily (and numbers approaching p=0.9999.. are the same as p=1). The client is going to need to budget for this.

评论 #3059049 未加载

评论 #3057600 未加载

hmottestad超过 13 年前

So 100% uptime is really difficult to achieve, hardware wise. Software wise you'll have to prove that there are no bugs in the system that might bring it down. That is much, much, harder.You can have 100 servers in 100 different countries and have the client automatically change to another server if the one they are connected to goes down. But if there is a software bug that crashes all your clients on start-up, or worse, crashes all your servers (think what happened to Skype not long ago).Also, never underestimate bugs in hardware (pentium 1). You'll need multiple locations, multiple hardware, multiple operating systems, multiple compilers, multiple versions of the software.... Standardizing on one of these components may bring down your entire system!

illumin8超过 13 年前

Look at F5 networks Global Traffic Manager. It's really just a fancy DNS server. You set your TTL (time to live) down to just a few seconds and it monitors your main and standby sites. If one of the sites goes down it changes your A records to point to the new site. It can even do load balancing across sites based on response time or number of connections.They are expensive, but this is how large companies like Yahoo keep close to 100% uptime.Explain to the customer that even with a hot site, the failover can take a few minutes. Also, some ISPs don't honor TTL and cache DNS queries for longer than they should. The Internet isn't perfect, and usually each extra 9 you add is around 4x more cost.

评论 #3056773 未加载

评论 #3056758 未加载

评论 #3058666 未加载

_corbett超过 13 年前

I was at an Akamai presentation the other day in which the salesperson claimed "100% reliability" of their services "not 99%, not 99.9% but 100%".

评论 #3056967 未加载

评论 #3056826 未加载

评论 #3056938 未加载

JoachimSchipper超过 13 年前

Seems reasonable, actually. "100%" is obviously not going to be achievable, but "external users should be ok if our office network fails" is not necessarily a bad requirement. There are lots of things that may make this client happy: a VPN to an "internal" server in an external data center, synchronous replication, etc.

patrickgzill超过 13 年前

Level3 offers 100% uptime in their SLA. All that means is that if the network goes down you get some money back.

dmbaggett超过 13 年前

Look at it from a business, rather than engineering perspective. Forget the achievability of the 100% target for a moment -- what target can you realistically achieve? Then, what does the contract say the remedy is for breach? As long as the remedy is not huge and -- this is very important -- is clearly quantifiable, it may be just fine to enter into such an agreement.You need the remedy to be clearly quantifiable (X dollars per Y minutes, for example) because otherwise you create an opportunity for dispute when the inevitable occurs and you breach. Resolving such a dispute could very well cost more than the remedy itself, even in the worst case.From an ethical standpoint, I would only enter into such an agreement with an understanding that "while we agree that it makes sense for you to request that target, we think realistically that we'll be closer to 99.9% (or whatever you truly believe)". Entering into an agreement with a 100% uptime clause is different from setting an expectation that uptime will actually be 100%.

DanBC超过 13 年前

Helping your customers understand what they actually want to buy is part of selling, surely? Things are made trickier by PHBs in the client company claiming that everything is mission critical and that they can never ever have any downtime ever for any reason. Educating these people about, for example, just how flaky email and dns can be is important for your sanity.See, for example, these couple of posts from a Microsoft public newsgroup ten years ago: (<a href="https://groups.google.com/group/microsoft.public.backoffice.smallbiz2000/msg/4bb5462eaa89b3b8?hl=en&dmode=source" rel="nofollow">https://groups.google.com/group/microsoft.public.backoffice....</a>) (<a href="https://groups.google.com/group/microsoft.public.backoffice.smallbiz2000/browse_thread/thread/90b8b85319f9b626/3e47ac766d184e4c?hl=en&lnk=gst&q=mission+critical#3e47ac766d184e4c" rel="nofollow">https://groups.google.com/group/microsoft.public.backoffice....</a>)Some customers are clueless, but at least they care about the data.

blrgeek超过 13 年前

Looks like client is asking for off-site failover, not really 100% uptime and the OP doesn't know how to achieve it over a WAN. Esp if this is a real enterprise customer, they want Disaster Recovery (DR).This is a solved problem, albeit not a commonly known solution. Any of F5, Radware, and other expensive boxes can do this. This can also be done through DNS or with HA-Proxy etc.

smoyer超过 13 年前

Offer them a 100% up-time guarantee for a year if they also promise to avoid being sick for the entire next year. If they can't avoid succumbing to a virus, why would they expect your service to avoid it (or any other sort of bug)?

nodata超过 13 年前

99.5% uptime is 100% uptime to 0 decimal places.

评论 #3056956 未加载

Joakal超过 13 年前

100% uptime is unrealistic for big companies because at scale, it costs a lot. For example, replicating is expensive with transmission, storage and maintenance costs.When the Amazon incident happened, I did an analysis and found that the cost about triples if stored in an external data centre. Almost 6x cost if hosting overseas despite same company.I then understood why companies like Reddit do not aim for 100% uptime possible beyond the data centre. It depends on how much the customer (or client) is willing to pay that determines the uptime aim (I think Reddit's aim for example is 90% at least).

yuliyp超过 13 年前

Let's say I wanted 100% reliable music listening. To do this, I buy a million of the original 30GB Zune media players, create a perfect failover system, so that if the sound from one of those stops for whatever reason (hardware, software, cosmic rays, etc), it'll switch to another one. I even move these Zunes all across the world, with AC provided, and multiple network links linking all of them, with satellite link backups between them.Then December 31, 2008 rolls around, and a tiny firmware bug knocks out all of them simultaneously for 24 hours. Oops.Not all failures are independent events.

babebridou超过 13 年前

Would a heavyweight client with nothing but static data and no network at all reach 100% uptime, from a contractual point of view? Even a wrist watch does not exactly guarantee 100% uptime.

pstoneman超过 13 年前

I can't post on serverfault, since the question's been locked, so I'll put useful things to consider here:* 100% SLA doesn't always mean 'It has to be up all the time'. Depending on the customer or the supplier, it can mean 'We'll aim to have it up all the time, but if it's not, we'll pay you compensation according to a predefined scale'. Clearly, in this case, you need to define quite firmly what 'up' and 'down' mean, how you measure them, how you time them, and how you decide what compensation to pay.* DNS failover or load balancing is often nearly good enough. It won't get you instantaneous failover, since you'll need to have a finite (albeit small) TTL, and some client stub resolver libraries cache stuff anyway in violation of the TTL. But it's an easy step on the way* If you want true 100% uptime, ultimately, you need a single IP (or range of IPs) which will be permanently reachable. That pretty much means the IPs need to come from one AS number - in other words, one ISP or one company.* You can choose an ISP or company which has multiple internet connections, peers with a lot of people in multiple locations, and has a well-designed network such that you feel confident they won't go offline. Amazon may be a good example, but they've had several recent high profile failures!* You could do it yourself - in which case, you'd need to become an ISP, get your own AS number, and set up peering arrangements with multiple suppliers in multiple locations. This can be very costly, and you still have to run a network and servers yourself in a reliable way* You might be able to find a supplier who peers in multiple locations, and anycasts their protected IPs within their AS. That way, the same IP comes from multiple locations and should be reliable. Akamai might do something similar to this, I think.* Ultimately, however you do it, you'll have a very difficult time making it impossible for it to fail. You're into the game of making it exponentially less and less likely that it'll fail, but you can't eliminate all risk. At the end of the day, your contract with your customer needs to define what happens if you should fail to reach 100% uptime. Is it breach of contract? Or do you need to pay a penalty fee? In either case, however you host it, you ideally need to make sure your suppliers compensation to you if they have a failure will cover the losses you incur.

mathattack超过 13 年前

The issue is of expectations and education.Many clients ask for the following without knowing better:- 100 pct uptime- Zero defects- Zero scope changes- Zero perceptible latency in all casesIt is up to the professional specialist to educate the client in terms they understand. Only after explaining in terms they understand can you call them unreasonable. If the client understands and is still unreasonable, the true professional has the obligation to walk away.

ck2超过 13 年前

DNS round-robin with mirror servers that run 24/7.

评论 #3056703 未加载

评论 #3056629 未加载

droithomme超过 13 年前

You can do this, but I suspect they might not like the cost quote of $10 trillion per year, and they 10 years of lead time it will cost you to build a worldwide network of secure underground bunkers with their own air, water, food and energy supplies, and a hardened shadow internet that duplicates the function of the existing internet.

samuel超过 13 年前

I would answer that Price is a function of uptime given by the following formula:Price(uptime) = K*(1/1 - uptime)so I would ask for infinite dollars...

评论 #3056619 未加载

gte910h超过 13 年前

This is not a technical issue, this is a contractual issue. You want to sign a contract that pays reasonable amounts for reasonable (far less than 100%) reliability for every minute of downtime.Then you have the impetus to minimize (but not eliminate) those minutes to a reasonable level.

bsiemon超过 13 年前

It seems as though someone read half an article about 99.999 uptime and decided to be clever.

Toenex超过 13 年前

I would agree to this but charge them infinite money.

评论 #3058043 未加载

rjurney超过 13 年前

Start looking at Erlang driven telephone systems in Europe with 100% up-time, and the what it took to build them.

capdiz超过 13 年前

I liked the dude that said "i wish i could down vote your client".

kahawe超过 13 年前

I see several possible approaches, if you really want to have that client.This easiest would be to just talk to them, try to find out what that "100%" is actually REALLY all about and make them understand that from a technical point of view, 100% will add a lot of things to the project budget. A "100%" demand in a smaller project for a typical small-to-medium business will likely mean something different than "100%" in a project for the NYSE. So, talk to the customer and find out what it is actually all about and then plan and quote according to their actual needs. So, this makes it more a requirements-engineering type of problem, not necessarily a hacker problem.Or you just say "yes, of course" and tell them how super reliable the system is and then let the guys in legal work it out in the fine print and cover your ass.. but don't expect much happiness and continued business from that client then once they find out what's going on.But, in a more honest approach, maybe this is actually all they really want and need? Maybe it actually is enough for them to have someone to blame and pay some penalties for violating SLAs. Again, you need to find that out.Not a typical hacker-hacker-problem but surely an issue a hacker would typically encounter, even on a daily basis, and should learn to deal with.