TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: ML Practictioners, how do you serve “large” models in production?

4 点作者 dankle超过 1 年前
We are about to start serving a large model in a production setting. We have a long history of serving smaller ML models in torch&#x2F;tf&#x2F;sklearn and in those cases we typically bundle the model in a docker image along with a fastapi backend to serve it in k8s (GKE in our case). It&#x27;s been working well for us over the years.<p>Now, when a model is 10+ Gb or some LLMs even 100+Gb, we can&#x27;t package them in a docker image anymore. How are those of you running these models in production serving them? Some options that we&#x27;re looking at include<p>1. Model in a storage bucket and custom fastapi backend, read model from bucket at pod startup 2. Model on a persistent disk that we mount with a PVC, custom fastapi backend, read model from disk on pod startup (faster than reading from a bucket) 3. Install KServe in our k8s cluster and commit to their best practices 4. Vertex AI Endpoints 5. HF Inference Endpoints 6. idk bento? Other tools we havent&#x27; considered?<p>So how to you folks do it? What have worked well and what are pitfalls when going from small ≈2Gb models to 10+Gb models.

暂无评论

暂无评论