テックエコー

This is really interesting. For SOTA inference systems, I've seen two general approaches:<p>* The "stack-centric" approach such as vLLM production stack, AIBrix, etc. These set up an entire inference stack for you including KV cache, routing, etc.<p>* The "pipeline-centric" approach such as NVidia Dynamo, Ray, BentoML. These give you more of an SDK so you can define inference pipelines that you can then deploy on your specific hardware.<p>It seems like LLM-d is the former. Is that right? What prompted you to go down that direction, instead of the direction of Dynamo?

What would be the benefit of this project over hosting VLLM in Ray?

I did a quick scan of the repo and didn't see any reference to Ray. Would this indicate that llm-d lacks support for pipeline parallelism?

I wonder if this is preferable to kServe

What would be the benefit of this project over hosting VLLM in Ray?

I did a quick scan of the repo and didn't see any reference to Ray. Would this indicate that llm-d lacks support for pipeline parallelism?

I wonder if this is preferable to kServe

LLM-D: Kubernetes-Native Distributed Inference

4 comments

LLM-D: Kubernetes-Native Distributed Inference

4 comments