Powerful or easy to use, choose one. That seems to be the big problem when looking for LLM inference options. Which is strange because if you look at LLMs compared to other ML models they should actually be pretty standardised. You have a single dominant model architecture. That almost all run on the GPUs of one of the largest market cap companies in the world. Using unstructured text as its input. So the input schema is basically a principal built-in types of the language all these models are created with.<p>However when looking at inference optimisations there are basically more frameworks than their are applications. Lot of abstracting of inference engine like vLLM. But that means that vLLM is now unintentionally gate keeping what people can do with LLMs in production. How did we end up here?<p>To prevent introducing yet-another (TM) inference framework. I can't help but wonder why did nobody figure out to run TensorRT-LLM on K8s. Or why didn't NVIDIA figure out that if you build a tool that potentially has the most feature rich inference implementation of LLMs. But it doesn't seem to be picked up quite as much you might need to work on the easy of implementation. Somebody please enable me to use TensorRT-LLM on k8s and call me.