I have a feeling the root cause is not in FastAPI or Flask, but in the architecture of the system itself.<p>Why? You are doing the inference in the same request, which is synchronous from the perspective of the caller. The request can be memory-intensive or cpu-intensive. And the issue is you can't efficiently consider all the workloads for a single machine without being bottlenecked by Python.<p>I would say that the problem is in your approach trying to use the webapp hammer for all the different flavors of nails in your system, using a language that isn't suited for concurrency. What I would do is decoupling the validation/interface logic from your models via a queue. This way you can scale your capacity according to workload and make sure the workload runs on hardware most relevant to the job.<p>I have a feeling trying to throw a webapp at the problem might not solve your root issue, only delay it in time.
The jump of going from model -> webserver by placing the webserver in the same process as the model is enticing because you can get it to work in under an hour by adding flask/django/fastapi to the env and decorating a function. The problem is that that your model and webserver do NOT scale in the same way, and if you don't realize this fast, you are going to be trying to fit a square peg through a round hole once you have adoption trying to make it work.<p>All models at scale eventually need to be executed by an async queue processor which is fundamentally different from a request response REST API. For simplicity managing this outside of the process making the web request will help you debug issues when people start asking why they are getting 502 responses. If you are forced to use python for this, I would always suggest of going to celery/huey/dramatiq as an immediate next step after the REST API MVP. I hear Celery is getting better but I have ran into issues over the year so it pains me to recommend it.
"FastAPI is not perfect for ML serving"<p>Yup. There's a huge amount of work that you need to do to do the whole ML lifecycle, and FastAPI doesn't support that out of the box like a full fledged ML Platform.<p>But you probably don't actually want a full ML Platform because they're all opinionated and if you try and fight them it's often worse than just serving it as an API via FastAPI...
Forgive me, I don't mean this flippantly, but it sounds like you implemented queuing and multiprocessing consumers on a Starlette webserver. "micro batching" is a feature enabled by the queueing. The GPU/CPU abstraction is nice, but I feel it's buried by the "FastAPI isn't good enough" digression. If it were framed as "here's what we added to the Starlette ecosystem", I would have approached it much more agreeably.<p>It would've been delightful to see "instantiate a runner in your existing Starlette application". I don't want to instantiate a Bento service. Perhaps I can mount the bento service on the Starlette application?<p>Apologies if I am still grossly misunderstanding. I tried to look through some of the _internal codebase to see how the Runner is implemented, the constructor signatures are very complex and the indirection to RunnerMethod had me cross-eyed.
> <i>While FastAPI does support async calls at the web request level, there is no way to call model predictions in an async manner.</i><p>This confuses me. How is that FastAPI’s fault? Can’t you just asynchronously delegate them to a concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor? What does Starlette provide here that FastAPI doesn’t? If the FastAPI limitations are due to ASGI, shouldn’t Starlette have the same limitations?
We have been using async python for GPU pydata , including fronting dask/dask_cuda for sharing and bigger-than-memory scenarios, so a lot rings true.<p>For model serving, we were thinking Triton (native vs python server) as it is a tightly scoped problem and optimized: any perf comparison there?