AWS vs. GCP reliability is wildly different

545 pointsby icyfoxover 2 years ago

43 comments

rwigginsover 2 years ago

There were 84 errors for GCP, but the breakdown says 74 409s and 5 timeouts. Maybe it was 79 409s? Or 10 timeouts?I suspect the 409 conflicts are probably from the instance name not being unique in the test. It looks like the instance name used was:<pre><code> instance_name = f"gpu-test-{int(time())}" </code></pre> which has a 1-second precision. The test harness appears to do a `sleep(1)` between test creations, but this sort of thing can have weird boundary cases, particularly because (1) it does cleanup after creation, which will have variable latency, (2) `int()` will truncate the fractional part of the second from `time()`, and (3) `time.time()` is not monotonic.I would not ask the author to spend money to test it again, but I think the 409s would probably disappear if you replaced `int(time())` with `uuid.uuid4()`.Disclosure: I work at Google - on Google Compute Engine. :-)

评论 #32934950 未加载

评论 #32933157 未加载

评论 #32933204 未加载

评论 #32934705 未加载

评论 #32939096 未加载

评论 #32933554 未加载

dark-starover 2 years ago

I wonder why someone would equate "instance launch time" with "reliability"... I won't go as far as calling it "clickbait" but wouldn't some other noun ("startup performance is wildly different") have made more sense?

评论 #32932153 未加载

评论 #32932495 未加载

评论 #32931875 未加载

评论 #32934288 未加载

评论 #32933702 未加载

评论 #32932087 未加载

评论 #32934002 未加载

评论 #32933494 未加载

评论 #32933714 未加载

评论 #32932101 未加载

评论 #32932580 未加载

lackerover 2 years ago

Anecdotally I tend to agree with the author. But this really isn't a great way of comparing cloud services.The fundamental problem with cloud reliability is that it depends on a lot of stuff that's out of your control, that you have no visibility into. I have had services running happily on AWS with no errors, and the next month without changing anything they fail all the time.Why? Well, we look into it and it turns out AWS changed something behind the scenes. There's a different underlying hardware behind the instance, or some resource started being in high demand because of some other customers.So, I completely believe that at the time of this test, this particular API was performing a lot better on AWS than on GCP. But I wouldn't count on it still performing this way a month later. Cloud services aren't like a piece of dedicated hardware where you test it one month, and then the next month it behaves roughly the same. They are changing a lot of stuff that you can't see.

评论 #32933132 未加载

评论 #32933995 未加载

评论 #32933708 未加载

remusover 2 years ago

> The offerings between the two cloud vendors are also not the same, which might relate to their differing response times. GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed. AWS only provisions defined VMs that have GPUs attached - the g4dn.x series of hardware here. Each of these instances are fixed in their CPU allocation, so if you want one particular varietal of GPU you are stuck with the associated CPU configuration.At a surface level, the above (from the article) seems like a pretty straightforward explanation? GCP gives you more flexibility in configuring GPU instances at the trade off of increased startup time variability.

评论 #32932131 未加载

politelemonover 2 years ago

A few weeks ago I needed to change the volume type on an EC2 instance to gp3. Following the instructions, the change happened while the instance was running. I didn't need to reboot or stop the instance, it just changed the type. While the instance was running.I didn't understand how they were able to do this, I had thought volume types mapped to hardware clusters of some kind. And since I didn't understand, I wasn't able to distinguish it from magic.

评论 #32931732 未加载

评论 #32932664 未加载

评论 #32931820 未加载

评论 #32931635 未加载

评论 #32932477 未加载

评论 #32932655 未加载

lomkjuover 2 years ago

Having being a high scale AWS user with a bill of +$1M/month and now working since 2 years with a company which uses GCP. I would say AWS is superior and way ahead.** NOTE: If you're a low scale company this won't matter to you **1. GKEWhen you cross a certain scale certain GKE components won't scale with you and SLOs on those components are crazy, it takes 15+ mins for us to update a GKE ingress controller backed Ingress.Cloud Logging hasn't been able to keep up with our scale, disabled since 2 years now. This last Q we got an email from them to enable it and try it again on our clusters, still have to confirm these claims as our scale is more higher now.Konnectivity agent release was really bad for us, it affected some components internally, total dev time we lost was more than 3 months debugging this issue. They had to disable konnectivity agent on our clusters, I had to collect TCP dumps and other evidences just to prove nothing was wrong on our end, fight with our TAM to get a meeting with the product team. After 4 months they agreed and reverted our clusters to SSH tunnels. Initially GCP support said they said they can't do this. Next Q Ill be updating the clusters hopefully they have fixed this by then.2. Support.I think AWS support always were more pro active in debugging with us, GCP support agents most of the times lack the expertise or proactiveness to debug/solve things in simple cases. We pay for enterprise support and don't see getting much from them. At AWS we had reviews of the infra how we could better it every 2 Qs and we got new suggestion and was also the time when we shared what we would like to see in their roadmap.3.Enterprisyness is missing with designA simple thing as cloudbuild doesn't have access to static IPs. We have to maintain a forward proxy just cause of this.L4 LBs were a mess you could only use specified ports in a (L4 LB) TCP proxy, For a tcp proxy based loadbalancer, the allowed set of ports are - [25, 43, 110, 143, 195, 443, 465, 587, 700, 993, 995, 1883, 3389, 5222, 5432, 5671, 5672, 5900, 5901, 6379, 8085, 8099, 9092, 9200, and 9300]. Today I see they have removed these restrictions. I don't know who came up with this idea to allow only a few ports on a L4 LB. I think such design decisions make it less Enterprisy.

outworlderover 2 years ago

Unclear what the article has to do with reliability. Yes, spinning up machines on GCP is incredibly fast and has always been. AWS is decent. Azure feels like I'm starting a Boeing 747 instead of a VM.However, there's one aspect where GCP is a clear winner on the reliability front. They auto-migrate instances transparently and with close to zero impact to workloads – I want to say zero impact but it's not technically zero.In comparison, in AWS you need to stop/start your instance yourself so that it will move to another hypervisor(depending on the actual issue AWS may do it for you). That definitely has impact on your workloads. We can sometimes architect around it but there's still something to worry about. Given the number of instances we run, we have multiple machines to deal with weekly. We get all these 'scheduled maintenance' events (which sometimes aren't really all that scheduled), with some instance IDs(they don't even bother sending the name tag), and we have to deal with that.I already thought stop/start was an improvement on tech at the time (Openstack, for example, or even VMWare) just because we don't have to think about hypervisors, we don't have to know, we don't care. We don't have to ask for migrations to be performed, hypervisors are pretty much stateless.However, on GCP? We had to stop/start instances exactly zero times, out of the thousands we run and have been running for years. We can see auto-migration events when we bother checking the logs. Otherwise, we don't even notice the migration happened.It's pretty old tech too:<a href="https://cloudplatform.googleblog.com/2015/03/Google-Compute-Engine-uses-Live-Migration-technology-to-service-infrastructure-without-application-downtime.html" rel="nofollow">https://cloudplatform.googleblog.com/2015/03/Google-Compute-...</a>

评论 #32934046 未加载

评论 #32932935 未加载

评论 #32932804 未加载

评论 #32932877 未加载

user-over 2 years ago

I wouldn't call this reliability, which already has a loaded definition in the cloud world, and instead something along time-to-start or latency or something.

评论 #32932832 未加载

0xbadcafebeeover 2 years ago

Reliability in general is measured on the basic principle of: does it function within our defined expectations? As long as it's launching, and it eventually responds within SLA/SLO limits, and on failure comes back within SLA/SLO limits, it is reliable. Even with GCP's multiple failures to launch, that may still be considered "reliable" within their SLA.If both AWS and GCP had the same SLA, and one did better than the other at starting up, you could say one is more performant than the other, but you couldn't say it's more reliable if they are both meeting the SLA. It's easy to look at something that never goes down and say "that is more reliable", but it might have been pure chance that it never went down. Always read the fine print, and don't expect anything better than what they guarantee.

zmmmmmover 2 years ago

> In total it scaled up about 3,000 T4 GPUs per platform> why I burned $150 on GPUsHow do you rent 3000 GPUs over a period of weeks for $150? Were they literally requisitioning it and releasing it immediately? Seems like this is quite a unrealistic type of usage pattern and would depend a lot on whether the cloud provider optimises to hand you back the same warm instance you just relinquished.> GCP allows you to attach a GPU to an arbitrary VM as a hardware acceleratorit's quite fascinating that GCP can do this. GPUs are physical things (!) do they provision every single instance type in the data center with GPUs? That would seem very expensive.

评论 #32932265 未加载

评论 #32932182 未加载

评论 #32932159 未加载

评论 #32934420 未加载

orfover 2 years ago

AWS has different pools of EC2 instances depending on the customer, the size of the account and any reservations you may have.Spawning a single GPU at varying times is nothing. Try spawning more than one, or using spot instances, and you’ll get a very different picture. We often run into capacity issues with GPU and even the new m6i instances at all times of the day.Very few realistic company size workloads need a single GPU. I would willingly wait 30 minutes for my instances to become available if it meant all of them where available at the same time.

playingalongover 2 years ago

This is great.I have always been feeling there is so little independent content on benchmarking the IaaS providers. There is so much you can measure in how they behave.

kccqzyover 2 years ago

Heard from a Googler that the internal infrastructure (Borg) is simply not optimized for quick startup. Launching a new Borg job often takes multiple minutes before the job runs. Not surprising at all.

评论 #32932043 未加载

评论 #32932644 未加载

评论 #32935151 未加载

评论 #32932065 未加载

评论 #32932164 未加载

评论 #32932168 未加载

devxpyover 2 years ago

Is this testing for spot instances?In my limited experience, persistent (on-demand) GCP instances always boot up much faster than AWS EC2 instances.

评论 #32932790 未加载

评论 #32931852 未加载

ajrossover 2 years ago

Worth pointing out that the article is measuring provisioning latency and success rates (how quickly can you get a GPU box running and whether or not you get an error back from the API when you try), and not "reliability" as most readers would understand it (how likely they are to do what you want them to do without failure).Definitely seems like interesting info, though.

jupp0rover 2 years ago

That's interesting but not what I expected when I read "reliability". I would have expected SLO metrics like uptime of the network or similar metrics that users would care about more. Usually when scaling a system that's built well you don't have hard short constraints on how fast an instance needs to be spun up. If you are unable to spin up any that can be problematic of course. Ideally this is all automated so nobody would care much about whether it takes a retry or 30s longer to create an instance. If this is important to you, you have other problems.

vienarrover 2 years ago

The article only talks about GPU start time, but the title is "CloudA vs CloudB reliability"bit of a stretch, right

runeksover 2 years ago

> These differences are so extreme they made me double check the process. Are the "states" of completion different between the two clouds? Is an AWS "Ready" premature compared to GCP? It anecdotally appears not; I was able to ssh into an instance right after AWS became ready, and it took as long as GCP indicated before I was able to login to one of theirs.This is a good point and should be part of the test: after launching, SSH into the machine and run a trivial task to confirm that the hardware works.

Animatsover 2 years ago

> GCP allows you to attach a GPU to an arbitrary VM as a hardware accelerator - you can separately configure quantity of the CPUs as needed.That would seem to indicate that asking for a VM on GCP gets you a minimally configured VM on basic hardware, and then it gets migrated to something bigger if you ask for more resources. Is that correct?That could make sense if, much of the time, users get a VM and spend a lot of time loading and initializing stuff, then migrate to bigger hardware to crunch.

评论 #32933496 未加载

humanfromearthover 2 years ago

We have constant autoscaling issues because of this in GCP - glad someone plotted this - hope people in GCP will pay a bit more attention to this. Thanks to the OP!

1-6over 2 years ago

This is all about cloud GPUs, I was expecting something totally different from the title.

TheMagicHorseyover 2 years ago

This is not reliability. This is a measure of how much spare capacity AWS seems to be leaving idle for you to snatch on-demand.This is going to vary a lot based on the time of year. Why don't you try this same experiment at around some time when there's a lot of retail sales activity (Black Friday), and watch AWS suddenly have much less capacity to dole out on-demand.To me reliability is a measure of what a cloud does compared to what it says it will do. GCP is not promissing you on-demand instances instantaneously is it? If you want that ... reserve capacity.

londons_exploreover 2 years ago

AWS normally has machines sitting idle just waiting for you to use. Thats why they can get you going in a couple of seconds.GCP on the other hand fills all machines with background jobs. When you want a machine, they need to terminate a background job to make room for you. That background job has a shutdown grace time. Usually thats 30 seconds.Sometimes, to prevent fragmentation, they actually need to shuffle around many other users to give you the perfect slot - and some of those jobs have start-new-before-stop-old semantics - that's why sometimes the delay is far higher too.

评论 #32932157 未加载

mnuttover 2 years ago

It may or may not matter for various use cases, but the EC2 instances in the test use EBS and the AMIs are lazily loaded from S3 on boot. So it may be possible that the boot process touches few files and quickly gets to 'ready' state, but you may have crummy performance for a while in some cases.I haven't used GCP much, but maybe they load the image onto the node prior to launch, accounting for some of the launch time difference?

PigiVinci83over 2 years ago

Thank you for this article, it confirms my direct experience. Never run a benchmarking test but I can see this every day.

lmeyerovover 2 years ago

I'd say that's a weak test of capacity. Would love to see on Azure - T4s or an equiv aren't even really provided anymore!We find reliability a diff story. Eg, our main source of downtime on Azure is they restart (live migrate?) our reserved T4s every few weeks, causing 2-10min outages per GPU per month.

cmcconomyover 2 years ago

I wish Azure was here to round it out!

DonHopkinsover 2 years ago

Anybody know if on GCP the cheaper ephemeral spot instances are available on managed instance groups and cloud run, where it spins up more instances according to demand, and how well it deals with replacing spot instances that drop dead, if so? How about AWS?

danielmarkbruceover 2 years ago

It's meant to say "ephemeral"... right? It's hard to read after that.

评论 #32931878 未加载

herpderperatorover 2 years ago

The author is using 'Quantile' which I hadn't heard of before, and when I did, it seems like it actually should be 'Percentile'. Percentiles are the percentages, which is what the author is referring to.

评论 #32933415 未加载

MonkeyMalarkyover 2 years ago

I would love to see the same for deploying things like a cloud/lambda function.

johndfsgdgdfgover 2 years ago

It's not surprising. Amazon is an amazing customer focus company. Google is a spyware company that only wants to make more by invading our privacy. Of course Amazon products will be better than Google.

kazinatorover 2 years ago

> This is particularly true for GPUs, which are uniquely squeezed by COVID shutdowns, POW mining, and growing deep learning modelsIs the POW mining part true any more? Hasn't mining moved to dedicated hardware?

评论 #32935208 未加载

AtNightWeCodeover 2 years ago

This benchmark (too) is probably incorrect. It produces 409:s so there are errors in there that I doubt are caused by GCP.

s-xyzover 2 years ago

Would be interested to see a comparison of lambda functions vs google 2nd gen functions. I think that gcp is more serverless focused

endisneighover 2 years ago

this doesn't really seem like a fair comparison, nor is it a measure of "reliability".

评论 #32932941 未加载

Jamie9912over 2 years ago

Should probably change the title to "AWS vs GCP on-demand GPU launch time consistency"

评论 #32937187 未加载

charbullover 2 years ago

Can you put this in context of the problem/use case /need you are solving for ?

rwalleover 2 years ago

Looks like the author has never heard of the word "histogram"That graph is a pain to see.

评论 #32934570 未加载

dekhnover 2 years ago

What would you expect? AWS is an org dedicated to giving customers what they want and charging them for it, while GCP is an org dedicated to telling customers what they want and using the revenue to get slightly better cost margins on Intel servers.

评论 #32932410 未加载

评论 #32934577 未加载

amaksover 2 years ago

The link is broken?

评论 #32932258 未加载

jqpabc123over 2 years ago

Thanks for the report. It only confirms my judgment.The word "Google" attached to anything is a strong indicator that you should look for an alternative.

duskwuffover 2 years ago

... why does the first graph show some instances as having a negative launch time? Is that meant to indicate errors, or has GCP started preemptively launching instances to anticipate requests?

评论 #32932088 未加载

评论 #32931934 未加载

评论 #32931929 未加载