AMD's MI300X Outperforms Nvidia's H100 for LLM Inference

280 点作者 fvv11 个月前

24 条评论

m_a_g11 个月前

"TensorWave is a cloud provider specializing in AI workloads. Their platform leverages AMD’s Instinct™ MI300X accelerators, designed to deliver high performance for generative AI workloads and HPC applications."I suggest taking the report with a grain of salt.

评论 #40668588 未加载

评论 #40667590 未加载

评论 #40667643 未加载

qeternity11 个月前

Why the hell are we doing 128 input token benchmarks in 2024. This is not representative of most workloads, and prefill perf is incredibly important.

评论 #40667856 未加载

sva_11 个月前

I try to be optimistic about this. Competition is absolutely needed in this space - $NVDA market cap is insane right now, about $0.6 trillion more than the entire Frankfurt Stock Exchange.

评论 #40667676 未加载

评论 #40668995 未加载

评论 #40667804 未加载

评论 #40669466 未加载

评论 #40667841 未加载

mistymountains11 个月前

I’m a AI Scientist and train a lot of models. Personally I think AMD is undervalued relative to Nvidia. No, chips aren’t as fast as Nvidia’s latest and yes, there are some hoops to get things working. But for most workloads in most industries (ignoring for the moment that AI is likely a poor use of capital), it will be much more cost effective and achieve about the same results.

tgtweak11 个月前

The market (and selling price) is reflecting the perceived value of nvidia's solution vs AMDs - comprehensively including tooling, software, TCO and managability.Also curious how many companies are dropping that much money on those kind of accelerators just to run 8x 7B param models in parallel... You're also talking about being able to train a 14B model on a single accelerator. I'd be curious to see how "full-accelerator train and inferrence" workloads would look ie: Training a 14B param model then inferrence throughput on a 4x14B workload.AMD (and almost every other inferrence claim maker so far... intel and apple specifically) have consistently cherry picked the benchmarks to claim a win over, and ignored the remainder which all show nvidia in the lead - and they've used mid-gen comparison models as many commenters here pointed out in this article.

评论 #40670637 未加载

评论 #40670698 未加载

评论 #40671313 未加载

michaelnny11 个月前

I'm wondering if the tensor parallel settings have any impact on the performance. My naive guess is yes but not sure.According to the article: """ AMD Configuration: Tensor parallelism set to 1 (tp=1), since we can fit the entire model Mixtral 8x7B in a single MI300X’s 192GB of VRAM.NVIDIA Configuration: Tensor parallelism set to 2 (tp=2), which is required to fit Mixtral 8x7B in two H100’s 80GB VRAM. """

评论 #40668135 未加载

huntertwo11 个月前

AMD has better seemingly better hardware - but not the production capacity to compete with Nvidia yet. Will be interesting to see margins compress when real competition catches up.Everybody thinks it’s CUDA that makes Nvidia the dominant player. It’s not - almost 40% of their revenue this year comes from mega corporations that use their own custom stack to interact with GPUs. It’s only a matter of time before competition catches up and gives us cheaper GPUs.

评论 #40668811 未加载

评论 #40668494 未加载

评论 #40668288 未加载

mark_l_watson11 个月前

A good start for AMD. I am also enthusiastic about another non-NVidea inference option: Groq (which I sometimes use).NVidia relies on TMSC for manufacturing. Samsung is building competing manufacturing infrastructure which is also a good thing, so Taiwan is not a single point of failure.

lccerina11 个月前

Without proper statistical metrics (why use average when 95% percentile is widely used?) and performance/watt this is a useless comparison.

评论 #40668647 未加载

评论 #40667993 未加载

iAkashPaul11 个月前

INT8/FP8 benchmarks would've been great, both cards could have loaded them with around 60GB VRAM instead of TP=2 on H100.

latchkey11 个月前

We just got higher performance out of open source. No need for MK1.<a href="https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmarking_brilliance_single_amd_mi300x_vllm/" rel="nofollow">https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmar...</a>

rjzzleep11 个月前

> Hardware: TensorWave node equipped with 8 MI300X accelerators, 2 AMD EPYC CPU Processors (192 cores), and 2.3 TB of DDR5 RAM.> MI300X Accelerator: 192GB VRAM, 5.3 TB/s, ~1300 TFLOPS for FP16> Hardware: Baremetal node with 8 H100 SXM5 accelerators with NVLink, 160 CPU cores, and 1.2 TB of DDR5 RAM.> H100 SXM5 Accelerator: 80GB VRAM, 3.35 TB/s, ~986 TFLOPS for FP16I really wonder about the pricing. In theory the MI300X is supposed to be cheaper, but whether is that is really the case in practice remains to be seen.

评论 #40667592 未加载

评论 #40667765 未加载

chillee11 个月前

I'm skeptical of these benchmarks for a number of reasons.1. They're only comparing against VLLM, which isn't SOTA for latency-focused inference. For example, their vllm benchmark on 2 GPUs sees 102 tokens/s for BS=1, gpt-fast gets around 190 tok/s. <a href="https://github.com/pytorch-labs/gpt-fast">https://github.com/pytorch-labs/gpt-fast</a> 2. As others have pointed out, they're comparing H100 running with TP=2 vs. 2 AMD GPUs running independently.Specifically,> To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.This is uhh.... very misleading, for a number of reasons. For one, at BS=1, what does running with 2 GPUs even mean? Do they mean that they're getting the results for one AMD GPUs at BS=1 and then... doubling that? Isn't that just... running at BS=2?3. It's very strange to me that their throughput nearly doubles going from BS=1 to BS=2. MoE models have an interesting property that low amounts of batching doesn't actually significantly improve their throughput, and so on their Nvidia vllm benchmark they just go from 102 => 105 tokens/s throughput when going from BS=1 to BS=2. But on AMD GPUs they go from 142 to 280? That doesn't make any sense to me.

zxexz11 个月前

Is this an ad for a new, closed-source, GPGPU backend?

评论 #40701831 未加载

评论 #40667555 未加载

zhyder11 个月前

Shouldn't the right benchmark be performance per watt? It's easy enough to add more chips to do LLM training or inference in parallel.Maybe the benchmark should be performance per $... though I suspect power consumption will eclipse the cost of purchasing the chips from NVDA or AMD (and costs of chips will vary over time and with discounts). EDIT: was wrong on eclipsing; still am looking for a more durable benchmark (performance per billion transistors?) given it's suspected NVDA's chips are over-priced due to demand outstripping supply for now, and AMD's are under- to get a foothold in this market.

评论 #40670474 未加载

instagraham11 个月前

Given that a lot of projects are written or optimised for CUDA, would it require an industry shift if AMD were to become a competitive source of GPUs for AI training?

评论 #40667684 未加载

评论 #40668243 未加载

DrNosferatu11 个月前

The comparison is between setups with different amounts of GPU RAM and there's no quantification of final performance/price.

评论 #40668192 未加载

jvlake11 个月前

At this point in history were still at ROCm vs CUDA... Schmicko hardware is only as good as the software you can write for it.

nextworddev11 个月前

We need more competition in the training space, not inference.For consumer grade inference, there's already many options available.

KaoruAoiShiho11 个月前

Pretty bad benchmarks to the point of being deliberately misleading. They benchmarked vllm which is less than half the speed of the inference leader lmdeploy: <a href="https://bentoml.com/blog/benchmarking-llm-inference-backends" rel="nofollow">https://bentoml.com/blog/benchmarking-llm-inference-backends</a>They also used Flywheel for AMD while not bothering to turn on Flywheel for Nvidia, which is crazy since Flywheel improves Nvidia performance by 70%. <a href="https://mk1.ai/blog/flywheel-launch" rel="nofollow">https://mk1.ai/blog/flywheel-launch</a>In this context the 33% performance lead by AMD looks terrible, and straight up looks slower.

DarkmSparks11 个月前

hopper (H100) is the predecessor to the current blackwell architecture.This is a new AMD vs last generation nvidia benchmark.

评论 #40667729 未加载

robblbobbl11 个月前

1.Investing (wasting) the billions. 2. Receive downvotes on ycombinator lol

jvlake11 个月前

Cool story. How supported is OpenCL compared to CUDA again?

amelius11 个月前

Are these fabbed at the same process node?(Otherwise it's apples and oranges)

评论 #40668422 未加载

24 条评论

m_a_g11 个月前

评论 #40668588 未加载

评论 #40667590 未加载

评论 #40667643 未加载

qeternity11 个月前

Why the hell are we doing 128 input token benchmarks in 2024. This is not representative of most workloads, and prefill perf is incredibly important.

评论 #40667856 未加载

sva_11 个月前

I try to be optimistic about this. Competition is absolutely needed in this space - $NVDA market cap is insane right now, about $0.6 trillion more than the entire Frankfurt Stock Exchange.

评论 #40667676 未加载

评论 #40668995 未加载

评论 #40667804 未加载

评论 #40669466 未加载

评论 #40667841 未加载

mistymountains11 个月前

tgtweak11 个月前

评论 #40670637 未加载

评论 #40670698 未加载

评论 #40671313 未加载

michaelnny11 个月前

评论 #40668135 未加载

huntertwo11 个月前

评论 #40668811 未加载

评论 #40668494 未加载

评论 #40668288 未加载

mark_l_watson11 个月前

lccerina11 个月前

Without proper statistical metrics (why use average when 95% percentile is widely used?) and performance/watt this is a useless comparison.

评论 #40668647 未加载

评论 #40667993 未加载

iAkashPaul11 个月前

INT8/FP8 benchmarks would've been great, both cards could have loaded them with around 60GB VRAM instead of TP=2 on H100.

latchkey11 个月前

rjzzleep11 个月前

评论 #40667592 未加载

评论 #40667765 未加载

chillee11 个月前

zxexz11 个月前

Is this an ad for a new, closed-source, GPGPU backend?

评论 #40701831 未加载

评论 #40667555 未加载

zhyder11 个月前

评论 #40670474 未加载

instagraham11 个月前

Given that a lot of projects are written or optimised for CUDA, would it require an industry shift if AMD were to become a competitive source of GPUs for AI training?

评论 #40667684 未加载

评论 #40668243 未加载

DrNosferatu11 个月前

The comparison is between setups with different amounts of GPU RAM and there's no quantification of final performance/price.

评论 #40668192 未加载

jvlake11 个月前

At this point in history were still at ROCm vs CUDA... Schmicko hardware is only as good as the software you can write for it.

nextworddev11 个月前

We need more competition in the training space, not inference.For consumer grade inference, there's already many options available.

KaoruAoiShiho11 个月前

DarkmSparks11 个月前

hopper (H100) is the predecessor to the current blackwell architecture.This is a new AMD vs last generation nvidia benchmark.

评论 #40667729 未加载

robblbobbl11 个月前

1.Investing (wasting) the billions. 2. Receive downvotes on ycombinator lol

jvlake11 个月前

Cool story. How supported is OpenCL compared to CUDA again?

amelius11 个月前

Are these fabbed at the same process node?(Otherwise it's apples and oranges)

评论 #40668422 未加载