We currently have an internal API that's core to our business. The models are loaded as .pkl files with scikit-learn joblib and served via Flask w/ Gunicorn using Gevent. We've tried Tornado as a worker class and Cherrypy as a replacement for Gunicorn -- none produce significant performance benefits.<p>We're hosting it in a Kubernetes cluster with really large nodes (140GB). Each container user ~5GB of RAM And considering the response time (~750ms), we can only add about 30 req/sec for each node we add ($1.5k). It appears the single request is CPU bound make it difficult to widely scale.<p>This is cost prohibitive and feels like we need to move towards other tools/approaches.<p>As the person who's managing the infrastructure, I'm less familiar with the current eco-system of larger-scale tooling. Ideally, the next iteration would keep the HTTP transport layer to allow for minimal changes to the rest of the system.<p>What would be a logical next step for us to scale the existing scikit-learn/Flask API?