We are about to start serving a large model in a production setting. We have a long history of serving smaller ML models in torch/tf/sklearn and in those cases we typically bundle the model in a docker image along with a fastapi backend to serve it in k8s (GKE in our case). It's been working well for us over the years.<p>Now, when a model is 10+ Gb or some LLMs even 100+Gb, we can't package them in a docker image anymore. How are those of you running these models in production serving them? Some options that we're looking at include<p>1. Model in a storage bucket and custom fastapi backend, read model from bucket at pod startup
2. Model on a persistent disk that we mount with a PVC, custom fastapi backend, read model from disk on pod startup (faster than reading from a bucket)
3. Install KServe in our k8s cluster and commit to their best practices
4. Vertex AI Endpoints
5. HF Inference Endpoints
6. idk bento? Other tools we havent' considered?<p>So how to you folks do it? What have worked well and what are pitfalls when going from small ≈2Gb models to 10+Gb models.