"TensorWave is a cloud provider specializing in AI workloads. Their platform leverages AMD’s Instinct™ MI300X accelerators, designed to deliver high performance for generative AI workloads and HPC applications."<p>I suggest taking the report with a grain of salt.
I try to be optimistic about this. Competition is absolutely needed in this space - $NVDA market cap is insane right now, about $0.6 trillion more than the entire Frankfurt Stock Exchange.
I’m a AI Scientist and train a lot of models. Personally I think AMD is undervalued relative to Nvidia. No, chips aren’t as fast as Nvidia’s latest and yes, there are some hoops to get things working. But for most workloads in most industries (ignoring for the moment that AI is likely a poor use of capital), it will be much more cost effective and achieve about the same results.
The market (and selling price) is reflecting the perceived value of nvidia's solution vs AMDs - comprehensively including tooling, software, TCO and managability.<p>Also curious how many companies are dropping that much money on those kind of accelerators just to run 8x 7B param models in parallel... You're also talking about being able to train a 14B model on a single accelerator. I'd be curious to see how "full-accelerator train and inferrence" workloads would look ie: Training a 14B param model then inferrence throughput on a 4x14B workload.<p>AMD (and almost every other inferrence claim maker so far... intel and apple specifically) have consistently cherry picked the benchmarks to claim a win over, and ignored the remainder which all show nvidia in the lead - and they've used mid-gen comparison models as many commenters here pointed out in this article.
I'm wondering if the tensor parallel settings have any impact on the performance. My naive guess is yes but not sure.<p>According to the article:
"""
AMD Configuration: Tensor parallelism set to 1 (tp=1), since we can fit the entire model Mixtral 8x7B in a single MI300X’s 192GB of VRAM.<p>NVIDIA Configuration: Tensor parallelism set to 2 (tp=2), which is required to fit Mixtral 8x7B in two H100’s 80GB VRAM.
"""
AMD has better seemingly better hardware - but not the production capacity to compete with Nvidia yet. Will be interesting to see margins compress when real competition catches up.<p>Everybody thinks it’s CUDA that makes Nvidia the dominant player. It’s not - almost 40% of their revenue this year comes from mega corporations that use their own custom stack to interact with GPUs. It’s only a matter of time before competition catches up and gives us cheaper GPUs.
A good start for AMD. I am also enthusiastic about another non-NVidea inference option: Groq (which I sometimes use).<p>NVidia relies on TMSC for manufacturing. Samsung is building competing manufacturing infrastructure which is also a good thing, so Taiwan is not a single point of failure.
We just got higher performance out of open source. No need for MK1.<p><a href="https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmarking_brilliance_single_amd_mi300x_vllm/" rel="nofollow">https://www.reddit.com/r/AMD_MI300/comments/1dgimxt/benchmar...</a>
> Hardware: TensorWave node equipped with 8 MI300X accelerators, 2 AMD EPYC CPU Processors (192 cores), and 2.3 TB of DDR5 RAM.<p>> MI300X Accelerator: 192GB VRAM, 5.3 TB/s, ~1300 TFLOPS for FP16<p>> Hardware: Baremetal node with 8 H100 SXM5 accelerators with NVLink, 160 CPU cores, and 1.2 TB of DDR5 RAM.<p>> H100 SXM5 Accelerator: 80GB VRAM, 3.35 TB/s, ~986 TFLOPS for FP16<p>I really wonder about the pricing. In theory the MI300X is supposed to be cheaper, but whether is that is really the case in practice remains to be seen.
I'm skeptical of these benchmarks for a number of reasons.<p>1. They're only comparing against VLLM, which isn't SOTA for latency-focused inference. For example, their vllm benchmark on 2 GPUs sees 102 tokens/s for BS=1, gpt-fast gets around 190 tok/s. <a href="https://github.com/pytorch-labs/gpt-fast">https://github.com/pytorch-labs/gpt-fast</a>
2. As others have pointed out, they're comparing H100 running with TP=2 vs. 2 AMD GPUs running independently.<p>Specifically,<p>> To make an accurate comparison between the systems with different settings of tensor parallelism, we extrapolate throughput for the MI300X by 2.<p>This is uhh.... very misleading, for a number of reasons. For one, at BS=1, what does running with 2 GPUs even mean? Do they mean that they're getting the results for one AMD GPUs at BS=1 and then... doubling that? Isn't that just... running at BS=2?<p>3. It's very strange to me that their throughput nearly doubles going from BS=1 to BS=2. MoE models have an interesting property that low amounts of batching doesn't actually significantly improve their throughput, and so on their Nvidia vllm benchmark they just go from 102 => 105 tokens/s throughput when going from BS=1 to BS=2. But on AMD GPUs they go from 142 to 280? That doesn't make any sense to me.
Shouldn't the right benchmark be performance per watt? It's easy enough to add more chips to do LLM training or inference in parallel.<p>Maybe the benchmark should be performance per $... though I suspect power consumption will eclipse the cost of purchasing the chips from NVDA or AMD (and costs of chips will vary over time and with discounts). EDIT: was wrong on eclipsing; still am looking for a more durable benchmark (performance per billion transistors?) given it's suspected NVDA's chips are over-priced due to demand outstripping supply for now, and AMD's are under- to get a foothold in this market.
Given that a lot of projects are written or optimised for CUDA, would it require an industry shift if AMD were to become a competitive source of GPUs for AI training?
Pretty bad benchmarks to the point of being deliberately misleading. They benchmarked vllm which is less than half the speed of the inference leader lmdeploy: <a href="https://bentoml.com/blog/benchmarking-llm-inference-backends" rel="nofollow">https://bentoml.com/blog/benchmarking-llm-inference-backends</a><p>They also used Flywheel for AMD while not bothering to turn on Flywheel for Nvidia, which is crazy since Flywheel improves Nvidia performance by 70%. <a href="https://mk1.ai/blog/flywheel-launch" rel="nofollow">https://mk1.ai/blog/flywheel-launch</a><p>In this context the 33% performance lead by AMD looks terrible, and straight up looks slower.