TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Breaking up with Flask and FastAPI: Why they don’t scale for ML model serving

38 pointsby yubozhaoalmost 3 years ago

10 comments

kroolikalmost 3 years ago
I have a feeling the root cause is not in FastAPI or Flask, but in the architecture of the system itself.<p>Why? You are doing the inference in the same request, which is synchronous from the perspective of the caller. The request can be memory-intensive or cpu-intensive. And the issue is you can&#x27;t efficiently consider all the workloads for a single machine without being bottlenecked by Python.<p>I would say that the problem is in your approach trying to use the webapp hammer for all the different flavors of nails in your system, using a language that isn&#x27;t suited for concurrency. What I would do is decoupling the validation&#x2F;interface logic from your models via a queue. This way you can scale your capacity according to workload and make sure the workload runs on hardware most relevant to the job.<p>I have a feeling trying to throw a webapp at the problem might not solve your root issue, only delay it in time.
评论 #31770363 未加载
评论 #31772009 未加载
评论 #31772754 未加载
评论 #31770684 未加载
detroitcoderalmost 3 years ago
The jump of going from model -&gt; webserver by placing the webserver in the same process as the model is enticing because you can get it to work in under an hour by adding flask&#x2F;django&#x2F;fastapi to the env and decorating a function. The problem is that that your model and webserver do NOT scale in the same way, and if you don&#x27;t realize this fast, you are going to be trying to fit a square peg through a round hole once you have adoption trying to make it work.<p>All models at scale eventually need to be executed by an async queue processor which is fundamentally different from a request response REST API. For simplicity managing this outside of the process making the web request will help you debug issues when people start asking why they are getting 502 responses. If you are forced to use python for this, I would always suggest of going to celery&#x2F;huey&#x2F;dramatiq as an immediate next step after the REST API MVP. I hear Celery is getting better but I have ran into issues over the year so it pains me to recommend it.
评论 #31770418 未加载
beckingzalmost 3 years ago
&quot;FastAPI is not perfect for ML serving&quot;<p>Yup. There&#x27;s a huge amount of work that you need to do to do the whole ML lifecycle, and FastAPI doesn&#x27;t support that out of the box like a full fledged ML Platform.<p>But you probably don&#x27;t actually want a full ML Platform because they&#x27;re all opinionated and if you try and fight them it&#x27;s often worse than just serving it as an API via FastAPI...
评论 #31772720 未加载
评论 #31770717 未加载
ttymckalmost 3 years ago
Forgive me, I don&#x27;t mean this flippantly, but it sounds like you implemented queuing and multiprocessing consumers on a Starlette webserver. &quot;micro batching&quot; is a feature enabled by the queueing. The GPU&#x2F;CPU abstraction is nice, but I feel it&#x27;s buried by the &quot;FastAPI isn&#x27;t good enough&quot; digression. If it were framed as &quot;here&#x27;s what we added to the Starlette ecosystem&quot;, I would have approached it much more agreeably.<p>It would&#x27;ve been delightful to see &quot;instantiate a runner in your existing Starlette application&quot;. I don&#x27;t want to instantiate a Bento service. Perhaps I can mount the bento service on the Starlette application?<p>Apologies if I am still grossly misunderstanding. I tried to look through some of the _internal codebase to see how the Runner is implemented, the constructor signatures are very complex and the indirection to RunnerMethod had me cross-eyed.
评论 #31771922 未加载
anderskaseorgalmost 3 years ago
&gt; <i>While FastAPI does support async calls at the web request level, there is no way to call model predictions in an async manner.</i><p>This confuses me. How is that FastAPI’s fault? Can’t you just asynchronously delegate them to a concurrent.futures.ThreadPoolExecutor or concurrent.futures.ProcessPoolExecutor? What does Starlette provide here that FastAPI doesn’t? If the FastAPI limitations are due to ASGI, shouldn’t Starlette have the same limitations?
评论 #31769927 未加载
sgt101almost 3 years ago
Advert disguised as experience report.
评论 #31769856 未加载
isoprophlexalmost 3 years ago
Nice advertorial but what about a a queue and some machines running torchserve?
lmeyerovalmost 3 years ago
We have been using async python for GPU pydata , including fronting dask&#x2F;dask_cuda for sharing and bigger-than-memory scenarios, so a lot rings true.<p>For model serving, we were thinking Triton (native vs python server) as it is a tightly scoped problem and optimized: any perf comparison there?
andrewstuartalmost 3 years ago
FastAPI is collapsing under the weight of its github issues (1,100) and pull requests (483).
评论 #31771360 未加载
评论 #31770497 未加载
评论 #31770002 未加载
timliu99almost 3 years ago
Wait... so you&#x27;re tell me FastAPI is slow...
评论 #31769795 未加载
评论 #31771222 未加载