Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

315 点作者 mariuz4 天前

23 条评论

I love Google Cloud Run and highly recommend it as the best option[1]. The Cloud Run GPU, however is not something I can recommend. It is not cost effective (instance based billing is expensive as opposed to request based billing), GPU choices are limited, and the general loading/unloading of model (gigabytes) from GPU memory makes it slow to be used as server less.Once you compare the numbers it is better to use a VM + GPU if the utilization of your service is even only for 30% of the day.1 - <a href="https://ashishb.net/programming/free-deployment-of-side-projects/" rel="nofollow">https://ashishb.net/programming/free-deployment-of-side-proj...</a>

评论 #44181592 未加载

评论 #44180615 未加载

评论 #44182548 未加载

评论 #44181129 未加载

评论 #44190837 未加载

评论 #44181399 未加载

评论 #44180621 未加载

评论 #44180610 未加载

isoprophlex4 天前

All the cruft of a big cloud provider, AND the joy of uncapped yolo billing that has the potential to drain your creditcard overnight. No thanks, I'll personally stick with Modal and vast.ai

评论 #44179337 未加载

评论 #44179148 未加载

评论 #44179534 未加载

评论 #44179523 未加载

评论 #44179605 未加载

评论 #44179469 未加载

评论 #44180325 未加载

评论 #44183785 未加载

评论 #44179217 未加载

mythz4 天前

The pricing doesn't look that compelling, here are the hourly rate comparisons vs runpod.io vs vast.ai:<pre><code> 1x L4 24GB: google: $0.71; runpod.io: $0.43, spot: $0.22 4x L4 24GB: google: $4.00; runpod.io: $1.72, spot: $0.88 1x A100 80GB: google: $5.07; runpod.io: $1.64, spot: $0.82; vast.ai $0.880, spot: $0.501 1x H100 80GB: google: $11.06; runpod.io: $2.79, spot: $1.65; vast.ai $1.535, spot: $0.473 8x H200 141GB: google: $88.08; runpod.io: $31.92; vast.ai $15.470, spot: $14.563 </code></pre> Google's pricing also assumes you're running it 24/7 for an entire month, where as this is just the hourly price for runpod.io or vast.ai which both bill per second. Wasn't able to find Google's spot pricing for GPUs.

评论 #44181071 未加载

评论 #44184935 未加载

评论 #44183480 未加载

评论 #44182625 未加载

评论 #44179335 未加载

jbarrow4 天前

I’m personally a huge fan of Modal, and have been using their serverless scale-to-zero GPUs for a while. We’ve seen some nice cost reductions from using them, while also being able to scale WAY UP when needed. All with minimal development effort.Interesting to see a big provider entering this space. Originally swapped to Modal because big providers weren’t offering this (e.g. AWS lambdas can’t run on GPU instances). Assuming all providers are going to start moving towards offering this?

评论 #44182627 未加载

评论 #44182130 未加载

评论 #44183779 未加载

评论 #44179141 未加载

montebicyclelo4 天前

Reason Cloud Run is so nice compared to other providers is that it has autoscaling, with scaling to 0. Meaning it can cost basically 0 if it's not being used. Also can set a cap on the scaling, e.g. 5 instances max, which caps the max cost of the service too. - Note, I only have experience with the CPU version of Cloud Run, (which is very reliable / easy).

评论 #44179081 未加载

huksley4 天前

A small and independent EU GPU cloud provider, DataCrunch (I am not affiliated), offers VMs with Nvidia GPUs even cheaper than Run Pod, etc1x A100 80Gb 1.37€/hour1x H100 80Gb 2.19€/hour

评论 #44179701 未加载

评论 #44179657 未加载

gabe_monroy4 天前

i'm the vp/gm responsible for cloud run and GKE. great to see the interest in this! happy to answer questions on this thread.

albeebe14 天前

Oh this is great news. After a $1000 bill running a model on vertex.ai continuously for a little test i forgot to shut down, this will be my go to now. I've been using Cloud Run for years running production microservices, and little hobby projects and i've found it simple and cost effective.

lemming4 天前

If I understand this correctly, I should be able to stand up an API running arbitrary models (e.g. from Hugging Face), and it’s not quite charged by the token but should be very cheap if my usage is sporadic. Is that correct? Seems pretty huge if so, most of the providers I looked at required a monthly fee to run a custom model.

评论 #44179108 未加载

评论 #44178975 未加载

ninetyninenine4 天前

Im tired of using AI in cloud services. I want user friendly locally owned AI hardware.Right now nothing is consumer friendly. I can’t get a packaged deal of some locally running ChatGPT quality UI or voice command system in an all in one package. Like what Macs did for PCs I want the same for AI.

评论 #44179310 未加载

评论 #44181567 未加载

评论 #44179420 未加载

评论 #44181982 未加载

评论 #44179557 未加载

felix_tech3 天前

I've been using this for daily/weekly ETL tasks which saves quite a lot of money vs having an instance on all the time but it's been clunky.The main issue is despite there being a 60 minute timeout available the API will just straight up not return a response code if your request takes > ~5 minutes in most cases so you gotta make sure you can poll where the datas being stored and let the client time out

评论 #44196274 未加载

jjuliano4 天前

I'm the developer of kdeps.com, and I really like Google Cloud Run, been using it since beta version. Kdeps outputs Dockerized full-stack AI agent apps that runs open-source LLMs locally, and my project works so well with GCR.

m14 天前

Love cloud run and this looks like a great addition. Only things I wish from cloud run is being able to run self hosted GitHub runners on it (last time I checked this wasn’t possible as it requires root), also the new worker pool feature seems great in practice but it looks like you have to write the scaler yourself rather than it being built in.

评论 #44187897 未加载

评论 #44186042 未加载

Aeolun4 天前

That’s 67ct / hour for a gpu enabled instance. That’s pretty good, but I have no idea how T4 GPU’s compare against others.

评论 #44179754 未加载

ivape4 天前

Does anyone actually run a modest sized app and can share numbers on what one gpu gets you? Assuming something like vllm for concurrent requests, what kind of throughput are you seeing? Serving an LLM just feels like a nightmare.

holografix4 天前

The value in this really is running small custom models or the absolute latest open weight models.Why bother when you can get payg API access to popular open weights models like Llama on Vertex AI model garden or at the edge on Cloudflare?

评论 #44179107 未加载

gardnr4 天前

The Nvidia L4 has 24GB of VRAM and consumes 72 watts, which is relatively low compared to other datacenter cards. It's not a monster GPU, but it should work OK for inference.

评论 #44179111 未加载

pier254 天前

How does this compare to Fly GPUs in terms of pricing?

ringeryless4 天前

i wonder what all this hype-driven overcapacity will be used for by future generations.once this bubble pops we are going to have some serious albeit high-latency hardware

评论 #44179997 未加载

评论 #44184656 未加载

评论 #44181238 未加载

treksis4 天前

Everything good except the price.

moeadham4 天前

if only they had some decent GPUs. L4s are pretty limited these days.

评论 #44182505 未加载

评论 #44181683 未加载

einpoklum4 天前

Why is commercial advertising published as a content article here?

omneity4 天前

> Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)This is my biggest pet-peeve with serverless GPU. 19 seconds is a horrible latency from the user’s perspective and that’s a best case scenario.If this is the best one of the most experienced teams in the world can do, with a small 4B model, then it feels like serverless is really restricted to non-interactive use cases.

评论 #44179680 未加载

评论 #44184707 未加载

评论 #44179738 未加载

评论 #44179929 未加载

评论 #44179577 未加载