I support any progress to erode the Nvidia monopoly.<p>That said from what I'm seeing here the free and open source (less other aspects of the CUDA stack, of course) TensorRT-LLM[0] almost certainly bests this implementation using the Nvidia hardware they reference for comparison. Compare to real (datacenter) Nvidia GPUs that aren't three years old and prepare to get your hair blown back.<p>I don't have an A6000 but as an example with the tensorrt_llm backend for Nvidia Triton Inference Server (also free and open source) I get roughly 30 req/s with Mistral 7B on my RTX 4090 with significantly lower latency and I'm in the early stages of tuning. Comparison benchmarks are tough, especially when published benchmarks like these are fairly scant on the real details.<p>TensorRT-LLM has only been public for a few months and if you peruse the docs, PRs, etc you'll see they have many more optimizations in the works.<p>In typical Nvidia fashion TensorRT-LLM runs on <i>any</i> Nvidia GPU (from laptop to datacenter) going back to Turing (five year old cards) assuming you have the VRAM. It even works on their Jetson line of hardware.<p>You can download and run this today, free and "open source" for these implementations at least. I'm extremely skeptical of the claim "MK1 Flywheel has the Best Throughput and Latency for LLM Inference on NVIDIA". You'll note they compare to vLLM, which is an excellent and incredible project but if you look at vLLM vs Triton w/ TensorRT-LLM the performance improvements are dramatic.<p>Of course it's the latest and greatest ($$$$$$ and unobtanium) but one look at H100/H200 performance[3] and you can see what happens when the vendor has a robust software ecosystem to help sell their hardware. Pay the Nvidia tax on the frontend for the hardware, get it back and then some as a dividend via the software especially when anything close (assuming this even is) is another paid product/SaaS/whatever their monetization strategy is.<p>At the risk of this turning into an Nvidia sales pitch Triton will do the same thing for absolutely any model via the ONNX, TensorRT, Pytorch, Tensorflow, OpenVINO, etc backends.<p>I have an implementation generating embeddings via bge-large-v1.5 that's also the fastest thing out there. Same for Whisper, vision models, whatever you want.<p>I feel like MK1 must be aware of TensorRT-LLM/Triton but of course those comparison benchmarks won't help sell their startup.<p>[0] - <a href="https://github.com/NVIDIA/TensorRT-LLM">https://github.com/NVIDIA/TensorRT-LLM</a><p>[1] - <a href="https://github.com/triton-inference-server/tensorrtllm_backend">https://github.com/triton-inference-server/tensorrtllm_backe...</a><p>[2] - <a href="https://mkone.ai/blog/mk1-flywheel-race-tuned-and-track-ready" rel="nofollow">https://mkone.ai/blog/mk1-flywheel-race-tuned-and-track-read...</a><p>[3] - <a href="https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Falcon180B-H200.md#llama-70b-on-h200-up-to-67x-a100">https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source...</a>