One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.<p>With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).<p>There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.<p>I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!<p>The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.
This is a bit snarky — but will Intel actually keep this product line alive for more than a few years? Having been bitten by building products around some of their non-x86 offerings where they killed good IP off and then failed to support it… I’m skeptical.<p>I truly do hope it is successful so we can have some alternative accelerators.
> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator<p>WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even <i>decide</i> where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?<p>Also: If you've got that many Cat 8 cables coming out the back of the thing <i>how do you even access it</i>? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.<p>I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.
128GB in one chip seems important with the rise of sparse architectures like MoE. Hopefully these are competitive with Nvidia's offerings, though in the end they will be competing for the same fab space as Nvidia if I'm not mistaken.
Wow, I very much appreciate the use of the 5 Ws and H [1] in this announcement. Thank you Intel for not subjecting my eyes to corp BS<p>[1] <a href="https://en.wikipedia.org/wiki/Five_Ws" rel="nofollow">https://en.wikipedia.org/wiki/Five_Ws</a>
> the only MLPerf-benchmarked alternative for LLMs on the market<p>I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.
Has anyone here bought an AI accelerator to run their AI SaaS service from their home to customers instead of trying to make a profit on top of OpenAI or Replicate<p>Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.
> Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth ...<p>I didn't know "terabytes (TB)" was a unit of memory bandwidth...
This appears to be manufactured by TSMC (or Samsung). The press release says it will use a 5nm process, which is not on Intel's roadmap.<p>"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"
I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.
If your metric is memory bandwidth or memory size, then this announcement gives you some concrete information. But - suppose my metric for performance is matrix-multiply-add (or just matrix-multiply) bandwidth. What MMA primitives does Gaudi offer (i.e. type combinations and matrix dimension combinations), and how many of such ops per second, in practice? The linked page says "64,000 in parallel", but that does not actually tell me much.
Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.
Is there an equivalent to this reference for Intel Gaudi?<p><a href="https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#" rel="nofollow">https://docs.nvidia.com/cuda/parallel-thread-execution/index...</a>
Anyone have experience and suggestions for an AI accelerator?<p>Think prototype consumer product with total cost preferably < $500, definitely less than $1000.
><i>Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types. </i><p>what is the programming interface here ? this is not CUDA right ...so how is this being done ?
I feel a little misled by the speedup numbers. They are comparing lower batch size h100/200 numbers to higher batch size gaudi 3 numbers for throughput (which is heavily improved by increasing batch size). I feel like there are some inference scenarios where this is better, but its really hard to tell from the numbers in the paper.
vector floating point performance comes in at 14 Tflops/s for FP32 and 28 Tflop/s for FP16.<p>Not the best of times for stuff that doesn't fit matrix processing units.
<i>> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator</i><p>How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..<p>Simultaneously impressive and disappointing.