I'm skeptical of these benchmarks for a number of reasons.<p>1. They're only comparing against VLLM, which isn't SOTA for latency-focused inference. For example, their vllm benchmark on 2 GPUs sees 102 tokens/s for BS=1, gpt-fast gets around 190 tok/s. <a href="https://github.com/pytorch-labs/gpt-fast">https://github.com/pytorch-labs/gpt-fast</a>
2. As others have pointed out, they're comparing H100 running with TP=2 vs. 2 AMD GPUs running independently.<p>Specifically,<p>> To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.<p>This is uhh.... very misleading, for a number of reasons. For one, at BS=1, what does running with 2 GPUs even mean? Do they mean that they're getting the results for one AMD GPUs at BS=1 and then... doubling that? Isn't that just... running at BS=2?<p>3. It's very strange to me that their throughput nearly doubles going from BS=1 to BS=2. MoE models have an interesting property that low amounts of batching doesn't actually significantly improve their throughput, and so on their Nvidia vllm benchmark they just go from 102 => 105 tokens/s throughput when going from BS=1 to BS=2. But on AMD GPUs they go from 142 to 280? That doesn't make any sense to me.