Hey everyone, I’ve been diving into the world of generative AI inference engines for quite some time at NLP Cloud, and I wanted to share some insights from a comparison I put together. I looked at four popular options—NVIDIA’s TensorRT-LLM, vLLM, Hugging Face’s Text Generation Inference (TGI), and LMDeploy—and ran some benchmarks to see how they stack up for real-world use cases. Thought this might spark some discussion here since I know a lot of you are working with LLMs or optimizing inference pipelines:<p>TensorRT-LLM<p>------------<p>NVIDIA’s beast for GPU-accelerated inference. Built on TensorRT, it optimizes models with layer fusion, precision tuning (FP16, INT8, even FP8), and custom CUDA kernels.<p>Pros: Blazing fast on NVIDIA GPUs—think sub-50ms latency for single requests on an A100 and ~700 tokens/sec at 100 concurrent users for LLaMA-3 70B Q4 (per BentoML benchmarks). Dynamic batching and tight integration with Triton Inference Server make it a throughput monster.<p>Cons: Setup can be complex if you’re not already in the NVIDIA ecosystem. You need to deal with model compilation, and it’s not super flexible for quick prototyping.<p>vLLM<p>----<p>Open-source champion for high-throughput inference. Uses PagedAttention to manage KV caches in chunks, cutting memory waste and boosting speed.<p>Pros: Easy to spin up (pip install, Python-friendly), and it’s flexible—runs on NVIDIA, AMD, even CPU. Throughput is solid (~600-650 tokens/sec at 100 users for LLaMA-3 70B Q4), and dynamic batching keeps it humming. Latency’s decent at 60-80ms solo.<p>Cons: It’s less optimized for single-request latency, so if you’re building a chatbot with one user at a time, it might not shine as much. Also, it’s still maturing—some edge cases (like exotic model architectures) might not be supported.<p>Hugging Face TGI<p>----------------<p>Hugging Face’s production-ready inference tool. Ties into their model hub (BERT, GPT, etc.) and uses Rust for speed, with continuous batching to keep GPUs busy.<p>Pros: Docker setup is quick, and it scales well. Latency’s 50-70ms, throughput matches vLLM (~600-650 tokens/sec at 100 users). Bonus: built-in output filtering for safety. Perfect if you’re already in the HF ecosystem.<p>Cons: Less raw speed than TensorRT-LLM, and memory can bloat with big batches. Feels a bit restrictive outside HF’s world.<p>LMDeploy<p>--------<p>This Toolkit from the MMRazor/MMDeploy crew, focused on fast, efficient LLM deployment. Features TurboMind (a high-performance engine) and a PyTorch fallback, with persistent batching and blocked KV caching for speed.<p>Pros: Decoding speed is nuts—up to 1.8x more requests/sec than vLLM on an A100. TurboMind pushes 4-bit inference 2.4x faster than FP16, hitting ~700 tokens/sec at 100 users (LLaMA-3 70B Q4). Low latency (40-60ms), easy one-command server setup, and it even handles multi-round chats efficiently by caching history.<p>Cons: TurboMind’s picky—doesn’t support sliding window attention (e.g., Mistral) yet. Non-NVIDIA users get stuck with the slower PyTorch engine. Still, on NVIDIA GPUs, it’s a performance beast.<p>What’s your experience with these tools? Any hidden issues I missed? Or are there other inference engines that should be mentioned? Would love to hear your thoughts!<p>Julien