I have a custom ML (PyTorch) model that I would like to set up as a service / API - it should be able to receive an input any time and promptly return an output. It should be able to scale up automatically to thousands of requests per second. The model itself takes around a minute to load, an inference step takes around 100ms. The model is being called only from my product's backend, so I have a bit of control over request volume.<p>I've been searching around and haven't found a clear standard/best way to do this.<p>Here are some of the options I've considered:<p>- Algorithmia (came across this yesterday, unsure how good it is and have some questions about the licensing)<p>- Something fancy with Kubernetes<p>- Write a load balancer and manually spin up new instances when needed.<p>Right now I'm leaning towards Algorithmia as it seems to be cost-effective and basically designed to do what I want. But I'm unsure how it handles long model loading times, or if the major cloud providers have similar services.<p>I'm quite new to this kind of architecture and would appreciate some thoughts on the best way to accomplish this!
I work on a free and open source project called Cortex that deploys PyTorch models (as well as other frameworks) as scalable APIs. It sounds perfect for what you're looking for: <a href="https://github.com/cortexlabs/cortex" rel="nofollow">https://github.com/cortexlabs/cortex</a><p>Cortex automates all of the devops work—from containerizing your model, to orchestrating Kubernetes, to autoscaling instances to meet demands. We have a bunch of PyTorch examples in our repo, if you're interested: <a href="https://github.com/cortexlabs/cortex/tree/master/examples" rel="nofollow">https://github.com/cortexlabs/cortex/tree/master/examples</a>
You can use sagemaker to perform ml model deployment.
<a href="https://aws.amazon.com/sagemaker/" rel="nofollow">https://aws.amazon.com/sagemaker/</a><p>Sagemaker takes care of infrastructure for you. It has also been integrated with various orchestrations like k8s, airflow etc.<p><a href="https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_operators_for_kubernetes_jobs.html#real-time-inference" rel="nofollow">https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_...</a>
Maybe you could try to save the pre-trained model to a storage bucket (e.g. s3) and then use flask (or whatever framework you like) to create the endpoints. When the flask app starts, the model can be loaded into memory from the storage bucket, and then you could create, for example, a /predict endpoint that accepts whatever data is needed to make the prediction. Deploy this to some PaaS (Heroku, AWS EBS, GCP App Engine) that has auto-scaling as a feature and you're sorted.
So you can go with Kubernetes. This is my preferred tool.<p>With Kubernetes, you can either wrap your model inside a container or mount it into the container from a persistent volume.<p>As for scaling you have two options:<p>1) Horizontal Pod Autoscaler <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/" rel="nofollow">https://kubernetes.io/docs/tasks/run-application/horizontal-...</a><p>2) Knative, which is Kubernetes serverless on-prem solution.
Algorithmia here . What are you concerned about license wise? You own all ip always. There is some restrictions if you choose to commercialize on our service (mostly guarantee you won't take it down on users). System was built for this. Happy to answer questions