Replicate vs. Fly GPU cold-start latency

90 pointsby venkiiover 1 year ago

13 comments

bfirshover 1 year ago

Founder of Replicate here. Yeah, our cold boots suck.Here's what we're doing:- Fine-tuned models now boot fast: <a href="https://replicate.com/blog/fine-tune-cold-boots">https://replicate.com/blog/fine-tune-cold-boots</a>- You can keep models switched on to avoid cold boots: <a href="https://replicate.com/docs/deployments">https://replicate.com/docs/deployments</a>- We've optimized how weights are loaded into GPU memory for some of the models we maintain, and we're going to open this up to all custom models soon.- We're going to be distributing images as individual files rather than as image layers, which makes pulling images much more efficient.Although our cold boots do suck, the comparison in this blog post is comparing apples to oranges because Fly machines are much lower level than Replicate models. It is more like a warm boot.It seems to be using a stopped Fly machine, which has already pulled the Docker image onto a node. When it starts, all it's doing is starting the Docker container. Creating the Fly machine or scaling it up would take much longer.On Replicate, the models auto-scale on a cluster. The model could be running anywhere in our cluster so we have to pull the image to that node when it starts.Something funny seems to be going on with the latency too. Our round-trip latency is about 200ms for a similar model. Would be curious to see the methodology, or maybe something was broken on our end.But we do acknowledge the problem. It's going to get better soon.

评论 #39414679 未加载

评论 #39414976 未加载

评论 #39421580 未加载

treesciencebotover 1 year ago

Just as a top-level disclaimer, I'm working at one of the companies in "this" space (serverless GPU compute) so take anything I say with a grain of salt.This is one of the things we (at <a href="https://fal.ai" rel="nofollow">https://fal.ai</a>) working very hard to solve. Because of ML workloads and their multiple GB environments (torch, all those cuda/cudnn libraries, and anything else they pull) it is a real challange just to get the container to start in a reasonable time frame. We had to write our own shared Python virtual environment runtime using SquashFS distributed thru a peer-to-peer caching system to bring it down sub-second mark.After the container boots, there is the aspect of storing model weights, which IMHO less challenging since it is just big blobs of data (compared to Python environments where there are thousands of smaller files where each might be sequentially read and incur a really major latency penalty). Distributing them once we had the system above was super easy since just like squashfs'd virtual environments, they are immutable data blobs.We are also starting to play with GPUDirect on some of our bare metal clusters and hopefully planning to expose it to our customers, which is especially important if your models is 40GB or higher. At that point, you are technically operating at the PCIE/SXM speeds which is ~2-3 seconds for a model of that size.

评论 #39416773 未加载

评论 #39415359 未加载

harrisonjacksonover 1 year ago

I spent a couple months hacking on a dreambooth product that let users train a model on their own photos and then generate new images w/ presets or their own prompts.The main costs were:- gpu time for training- gpu time for inference- storage costs for the users' models- egress fees to download modelI ended up using banana.dev and runpod.io for the serverless gpus. Both were great, easy to hook into, and highly customizable.I spent a bunch of time trying to optimize download speed, egress fees, gpu spot pricing, gpu location, etc.R2 is cheaper than s3 - free egress! But the download speeds were MUCH worse than s3 - enough that it ended up not even being competitive.It was frequently cheaper to use more expensive GPUs w/ better location and network speeds. That factored more into the pricing than how long the actual inference took on each instance.Likewise, if your most important metric is time from boot to starting inference then network access might be the limiting factor.

评论 #39415850 未加载

moscickyover 1 year ago

Replicate has really long boot times for custom models - 2/3 minutes if you are lucky and up to 30 minutes if they are having problems.While we loved the dev experience we just couldn’t make it work with frequently switching models / LORA weights.We switched to beam (<a href="https://www.beam.cloud">https://www.beam.cloud</a>) and it’s so much better. Their cold start times are consistently small and they provide caching layer for model files i.e volumes which make switching between models a breeze.Beam also has much better pricing policy. For custom models on replicate you pay for boot times (which are very long!) so you are paying a lot of $ for a single request.With beam you only pay for inference and idle time.

评论 #39414241 未加载

mardifoufsover 1 year ago

Cold start is super bad on azure machine learning endpoints, at least it was when we tried to use it a few months ago. Even before it gets to the environment loading step. Seems like even these results are better than what we got on AML. So it's impressive imo!

jonnycoderover 1 year ago

I wrote a review about Replicate last week and cog I was using, insanely-fast-whisper, had boot times exceeding 4 minutes. I wish there was more we can observe to find out the cause of the slow start up times. I was suspecting it was dependencies.<a href="https://open.substack.com/pub/jonolson/p/replicatecom-review-transcribing?r=84lpf&utm_campaign=post&utm_medium=web" rel="nofollow">https://open.substack.com/pub/jonolson/p/replicatecom-review...</a>

Aeolunover 1 year ago

I haven’t had as much fun with inference models as I’ve had since finding out Fly GPU servers start and stop at the drop of a hat.I can literally boot the server for the 10-20s it takes to run a bunch of generations, and have it shut down automatically afterwards. It feels like magic.Sure, creating the image after a new deployment takes up to two minutes, but once it’s there it’s incredibly fast.

timenovaover 1 year ago

Is the 100 MB model being downloaded from HuggingFace on Fly too?I ask this because Fly has immutable Docker containers which wouldn't store any data unless you use Fly Volumes. So it could be that Fly is downloading the 100MB model each time it cold-boots.If that's the case, a multi-stage Dockerfile could help in bundling the model in, and perhaps reducing cold-boot time even further.

评论 #39413421 未加载

评论 #39413012 未加载

hantuskover 1 year ago

Not affiliated, but happy modal.com user, which has very fast cold starts for the few demos i run with them.

评论 #39415142 未加载

评论 #39415153 未加载

dcsanover 1 year ago

replicate is also very hard to predict costs on, I've found their salespeople are reluctant to make any predictions since things are changing so quickly. So it might take 5mins to cold-boot a model for a 2s prediction run, but its not clear how much you pay for that run.Replicate created the cog spec and is a fantastic resource for browsing and playing with new models. They are a social destination too.But flyIO is nice and simple for docker side projects, I hope their cog deployments are as smooth as there are a few extra pieces involved.

iAkashPaulover 1 year ago

I tend to have a preflight script which is run prior to the final docker command. This let's the layers have the cached weights & avoids dealing with downloading or making changes to the codebase for loading downloaded weights. Would shave off 10s from both providers.

iambatemanover 1 year ago

To make sure I understand…this would provide a private API endpoint for a developer to call an LLM model in a serverless way?They could call it and just pay for the time spent, not a persistent server.

评论 #39413699 未加载

brianjkingover 1 year ago

Would be curious to see how this compares to Modal.com too.