Intel Gaudi 3 AI Accelerator

435 点作者 goldemerald大约 1 年前

31 条评论

mk_stjames大约 1 年前

One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

评论 #39987013 未加载

评论 #39983722 未加载

评论 #39984115 未加载

评论 #39987345 未加载

评论 #39984317 未加载

评论 #39983809 未加载

评论 #39986666 未加载

评论 #39998179 未加载

neilmovva大约 1 年前

A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.

评论 #39982676 未加载

评论 #39987297 未加载

kylixz大约 1 年前

This is a bit snarky — but will Intel actually keep this product line alive for more than a few years? Having been bitten by building products around some of their non-x86 offerings where they killed good IP off and then failed to support it… I’m skeptical.I truly do hope it is successful so we can have some alternative accelerators.

评论 #39982751 未加载

评论 #39985532 未加载

评论 #39983584 未加载

评论 #39984696 未加载

评论 #39983724 未加载

评论 #39982698 未加载

评论 #39982868 未加载

评论 #39987427 未加载

riskable大约 1 年前

> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 acceleratorWHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.

评论 #39981766 未加载

评论 #39981761 未加载

评论 #39981783 未加载

评论 #39981870 未加载

评论 #39981742 未加载

评论 #39996855 未加载

评论 #39981932 未加载

评论 #39981694 未加载

评论 #39981680 未加载

sairahul82大约 1 年前

Can we expect the price of 'Gaudi 3 PCIe' to be reasonable enough to put in a workstation? That would be a game changer for local LLMs

评论 #39982006 未加载

评论 #39981922 未加载

评论 #39984876 未加载

rileyphone大约 1 年前

128GB in one chip seems important with the rise of sparse architectures like MoE. Hopefully these are competitive with Nvidia's offerings, though in the end they will be competing for the same fab space as Nvidia if I'm not mistaken.

评论 #39981650 未加载

kaycebasques大约 1 年前

Wow, I very much appreciate the use of the 5 Ws and H [1] in this announcement. Thank you Intel for not subjecting my eyes to corp BS[1] <a href="https://en.wikipedia.org/wiki/Five_Ws" rel="nofollow">https://en.wikipedia.org/wiki/Five_Ws</a>

评论 #39982161 未加载

latchkey大约 1 年前

> the only MLPerf-benchmarked alternative for LLMs on the marketI hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.

yieldcrv大约 1 年前

Has anyone here bought an AI accelerator to run their AI SaaS service from their home to customers instead of trying to make a profit on top of OpenAI or ReplicateSeems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.

评论 #39983089 未加载

1024core大约 1 年前

> Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth ...I didn't know "terabytes (TB)" was a unit of memory bandwidth...

评论 #39981523 未加载

评论 #39981632 未加载

评论 #39986925 未加载

throwaway4good大约 1 年前

Worth noting that it is fabbed by TSMC.

InvestorType大约 1 年前

This appears to be manufactured by TSMC (or Samsung). The press release says it will use a 5nm process, which is not on Intel's roadmap."The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"

评论 #39985867 未加载

geertj大约 1 年前

I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.

评论 #39984251 未加载

评论 #39983577 未加载

einpoklum大约 1 年前

If your metric is memory bandwidth or memory size, then this announcement gives you some concrete information. But - suppose my metric for performance is matrix-multiply-add (or just matrix-multiply) bandwidth. What MMA primitives does Gaudi offer (i.e. type combinations and matrix dimension combinations), and how many of such ops per second, in practice? The linked page says "64,000 in parallel", but that does not actually tell me much.

alecco大约 1 年前

Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.

评论 #39982643 未加载

评论 #39984259 未加载

ancharm大约 1 年前

Is the scheduling / bare metal software open source through OneAPI? Can a link be posted showing it if so?

cavisne大约 1 年前

Is there an equivalent to this reference for Intel Gaudi?<a href="https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#" rel="nofollow">https://docs.nvidia.com/cuda/parallel-thread-execution/index...</a>

AnonMO大约 1 年前

it's crazy that Intel can't manufacture its own chips atm, but it looks like that might change in the coming years as new fabs come online.

colechristensen大约 1 年前

Anyone have experience and suggestions for an AI accelerator?Think prototype consumer product with total cost preferably < $500, definitely less than $1000.

评论 #39981809 未加载

评论 #39981896 未加载

评论 #39983884 未加载

评论 #39982129 未加载

评论 #39982319 未加载

评论 #39981811 未加载

评论 #39981985 未加载

评论 #39982135 未加载

MrYellowP大约 1 年前

<a href="https://www.dwds.de/wb/Gaudi" rel="nofollow">https://www.dwds.de/wb/Gaudi</a>That's amusing. :D

sandGorgon大约 1 年前

>Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types. what is the programming interface here ? this is not CUDA right ...so how is this being done ?

评论 #39987232 未加载

chessgecko大约 1 年前

I feel a little misled by the speedup numbers. They are comparing lower batch size h100/200 numbers to higher batch size gaudi 3 numbers for throughput (which is heavily improved by increasing batch size). I feel like there are some inference scenarios where this is better, but its really hard to tell from the numbers in the paper.

andersa大约 1 年前

Price?

amelius大约 1 年前

Missing in these pictures are the thermal management solutions.

评论 #39984830 未加载

评论 #39987242 未加载

KeplerBoy大约 1 年前

vector floating point performance comes in at 14 Tflops/s for FP32 and 28 Tflop/s for FP16.Not the best of times for stuff that doesn't fit matrix processing units.

mpreda大约 1 年前

How much does one such card cost?

metadat大约 1 年前

> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 acceleratorHow much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..Simultaneously impressive and disappointing.

评论 #39985761 未加载

评论 #39985976 未加载

YetAnotherNick大约 1 年前

So now hardware companies stopped reporting FLOP/s number and reports in arbitrary unit of parallel operation/s.

评论 #39982338 未加载

m3kw9大约 1 年前

Can you run Cuda on it?

评论 #39984656 未加载

brcmthrowaway大约 1 年前

Does this support apple silicon?

whalesalad大约 1 年前

<a href="https://www.merriam-webster.com/dictionary/gaudy" rel="nofollow">https://www.merriam-webster.com/dictionary/gaudy</a>

评论 #39982000 未加载

评论 #39981997 未加载

评论 #39982775 未加载