TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: ML Practictioners, how do you serve “large” models in production?

4 pointsby danklealmost 2 years ago
We are about to start serving a large model in a production setting. We have a long history of serving smaller ML models in torch&#x2F;tf&#x2F;sklearn and in those cases we typically bundle the model in a docker image along with a fastapi backend to serve it in k8s (GKE in our case). It&#x27;s been working well for us over the years.<p>Now, when a model is 10+ Gb or some LLMs even 100+Gb, we can&#x27;t package them in a docker image anymore. How are those of you running these models in production serving them? Some options that we&#x27;re looking at include<p>1. Model in a storage bucket and custom fastapi backend, read model from bucket at pod startup 2. Model on a persistent disk that we mount with a PVC, custom fastapi backend, read model from disk on pod startup (faster than reading from a bucket) 3. Install KServe in our k8s cluster and commit to their best practices 4. Vertex AI Endpoints 5. HF Inference Endpoints 6. idk bento? Other tools we havent&#x27; considered?<p>So how to you folks do it? What have worked well and what are pitfalls when going from small ≈2Gb models to 10+Gb models.

no comments

no comments