4T transistors, one giant chip (Cerebras WSE-3) [video]

118 点作者 asdfasdf1大约 1 年前

24 条评论

cs702大约 1 年前

According to the company, the new chip will enable training of AI models with up to 24 trillion parameters. Let me repeat that, in case you're as excited as I am: 24. Trillion. Parameters. For comparison, the largest AI models currently in use have around 0.5 trillion parameters, around 48x times smaller.Each parameter is a connection between artificial neurons. For example, inside an AI model, a linear layer that transforms an input vector with 1024 elements to an output vector with 2048 elements has 1024×2048 = ~2M parameters in a weight matrix. Each parameter specifies by how much each element in the input vector contributes to or subtracts from each element in the output vector. Each output vector element is a weighted sum (AKA a linear combination), of each input vector element.A human brain has an estimated 100-500 trillion synapses connecting biological neurons. Each synapse is quite a complicated biological structure[a], but if we oversimplify things and assume that every synapse can be modeled as a single parameter in a weight matrix, then the largest AI models in use today have approximately 100T to 500T ÷ 0.5T = 200x to 1000x fewer connections between neurons that the human brain. If the company's claims prove true, this new chip will enable training of AI models that have only 4x to 20x fewer connections that the human brain.We sure live in interesting times!---[a] <a href="https://en.wikipedia.org/wiki/Synapse" rel="nofollow">https://en.wikipedia.org/wiki/Synapse</a>

评论 #39695212 未加载

评论 #39696255 未加载

评论 #39696999 未加载

brucethemoose2大约 1 年前

Reposting the CS-2 teardown in case anyone missed it. The thermal and electrical engineering is absolutely nuts:<a href="https://vimeo.com/853557623" rel="nofollow">https://vimeo.com/853557623</a><a href="https://web.archive.org/web/20230812020202/https://www.youtube.com/watch?v=pzyZpauU3Ig" rel="nofollow">https://web.archive.org/web/20230812020202/https://www.youtu...</a>(Vimeo/Archive because the original video was taken down from YouTube)

评论 #39696278 未加载

评论 #39694631 未加载

评论 #39696617 未加载

评论 #39695129 未加载

fxj大约 1 年前

It has its own programming language CSL<a href="https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cerebras-sdk" rel="nofollow">https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cereb...</a>"CSL allows for compile time execution of code blocks that take compile-time constant objects as input, a powerful feature it inherits from Zig, on which CSL is based. CSL will be largely familiar to anyone who is comfortable with C/C++, but there are some new capabilities on top of the C-derived basics."<a href="https://github.com/Cerebras/csl-examples">https://github.com/Cerebras/csl-examples</a>

评论 #39694853 未加载

评论 #39702257 未加载

RetroTechie大约 1 年前

If you were to add up all transistors fabricated worldwide, up until <year>, such that total roughly matches the # on this beast, what year would you arrive? Hell, throw in discrete transistors if you want.How many early supercomputers / workstations etc would that include? How much progress did humanity make using all those early machines (or any transistorized device!) combined?

评论 #39696569 未加载

ortusdux大约 1 年前

Not trying to sound critical, but is there a reason to use 4B,000 vs 4T?

评论 #39695078 未加载

评论 #39697206 未加载

评论 #39694832 未加载

评论 #39693853 未加载

评论 #39695670 未加载

评论 #39693863 未加载

imbusy111大约 1 年前

I wish they dug into how this monstrosity is powered. Assuming 1V and 24kW, that's 24kAmps.

评论 #39702285 未加载

评论 #39693991 未加载

asdfasdf1大约 1 年前

<a href="https://www.cerebras.net/press-release/cerebras-announces-third-generation-wafer-scale-engine" rel="nofollow">https://www.cerebras.net/press-release/cerebras-announces-th...</a><a href="https://www.cerebras.net/product-chip/" rel="nofollow">https://www.cerebras.net/product-chip/</a>

Rexxar大约 1 年前

Is there a reason it's not roughly a disc if they use the whole wafer ? They could have 50% more surface.

评论 #39697438 未加载

评论 #39696234 未加载

modeless大约 1 年前

As I understand it, WSE-2 was kind of handicapped because its performance could only really be harnessed if the neural net fit in the on-chip SRAM. Bandwidth to off-chip memory (normalized to FLOPS) was not as high as Nvidia. Is that improved with WSE-3? Seems like the SRAM is only 10% bigger, so that's not helping.In the days before LLMs 44 GB of SRAM sounded like a lot, but these days it's practically nothing. It's possible that novel architectures could be built for Cerebras that leverage the unique capabilities, but the inaccessibility of the hardware is a problem. So few people will ever get to play with one that it's unlikely new architectures will be developed for it.

评论 #39699673 未加载

评论 #39698979 未加载

imtringued大约 1 年前

One thing I don't understand about their architecture is that they have spent so much effort building this monster of a chip, but if you are going to do something crazy, why not work on processing in memory instead? At least for transformers you will primarily be bottlenecked on matrix multiplication and almost nothing else, so you only need to add a simple matrix vector unit behind your address decoder and then almost every AI accelerator will become obsolete over night. I wouldn't suggest this to a random startup though.

评论 #39697191 未加载

marmaduke大约 1 年前

Hm, let's wait and see what the gemm/W perf is, and how many programmer hours it takes to implement say an mlp. Wafer scale data flow may not be a solved problem?

tivert大约 1 年前

Interesting. I know there's a lot of attempts to hobble China by limiting their access to cutting edge chips and semiconductor manufacturing technology, but could something like this be a workaround for them, at least for datacenter-type jobs?Maybe it wouldn't be as powerful as one of these, due to their less capable fabs, but something that's good enough to get the job done in spite of the embargoes.

评论 #39694793 未加载

asdfasdf1大约 1 年前

WHITE PAPER Training Giant Neural Networks Using Weight Streaming on Cerebras Wafer-Scale Clusters<a href="https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper%20111521.pdf" rel="nofollow">https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...</a>

asdfasdf1大约 1 年前

- Interconnect between WSE-2's chips in the cluster was 150GB/s, much lower than NVIDIA's 900GB/s.- non-sparse fp16 in WSE-2 was 7.5 tflops (about 8 H100s, 10x worse performance per dollar)Does anyone know the WSE-3 numbers? Datasheet seems lacking loads of detailsAlso, 2.5 million USD for 1 x WSE-3, why just 44GB tho???

评论 #39696230 未加载

评论 #39694503 未加载

评论 #39697341 未加载

评论 #39694995 未加载

holoduke大约 1 年前

Better sell all nvidia stocks. Once these chips are common there is no need anymore for GPUs in training super large AI models.

评论 #39696074 未加载

评论 #39695880 未加载

评论 #39697053 未加载

TradingPlaces大约 1 年前

Near-100% yield is some dark magic.

api大约 1 年前

I'm surprised we haven't seen wafer scale many-core CPUs for cloud data centers yet.

评论 #39702266 未加载

beautifulfreak大约 1 年前

So it's increased from 2.6 to 4 trillion transistors over the previous version.

tedivm大约 1 年前

The missing numbers that I really want to see-* Power Usage* Rack Size (last one I played with was 17u)* Cooling requirements

tibbydudeza大约 1 年前

Wow - it as bigger than my kitchen tiles - who uses them ???. NSA ???.

pgraf大约 1 年前

related discussion (2021): <a href="https://news.ycombinator.com/item?id=27459466">https://news.ycombinator.com/item?id=27459466</a>

hashtag-til大约 1 年前

Any idea on what’s the yield on these chips?

评论 #39697921 未加载

wizardforhire大约 1 年前

But can it run doom?

评论 #39695114 未加载

评论 #39697345 未加载

评论 #39696196 未加载

AdamH12113大约 1 年前

Title should be either "4,000,000,000,000 Transistors" (as in the actual video title) or "4 Trillion Transistors" or maybe "4T Transistors". "4B,000" ("four billion thousand"?) looks like 48,000 (forty-eight thousand).

评论 #39695580 未加载

24 条评论

cs702大约 1 年前

评论 #39695212 未加载

评论 #39696255 未加载

评论 #39696999 未加载

brucethemoose2大约 1 年前

评论 #39696278 未加载

评论 #39694631 未加载

评论 #39696617 未加载

评论 #39695129 未加载

fxj大约 1 年前

评论 #39694853 未加载

评论 #39702257 未加载

RetroTechie大约 1 年前

评论 #39696569 未加载

ortusdux大约 1 年前

Not trying to sound critical, but is there a reason to use 4B,000 vs 4T?

评论 #39695078 未加载

评论 #39697206 未加载

评论 #39694832 未加载

评论 #39693853 未加载

评论 #39695670 未加载

评论 #39693863 未加载

imbusy111大约 1 年前

I wish they dug into how this monstrosity is powered. Assuming 1V and 24kW, that's 24kAmps.

评论 #39702285 未加载

评论 #39693991 未加载

asdfasdf1大约 1 年前

Rexxar大约 1 年前

Is there a reason it's not roughly a disc if they use the whole wafer ? They could have 50% more surface.

评论 #39697438 未加载

评论 #39696234 未加载

modeless大约 1 年前

评论 #39699673 未加载

评论 #39698979 未加载

imtringued大约 1 年前

评论 #39697191 未加载

marmaduke大约 1 年前

Hm, let's wait and see what the gemm/W perf is, and how many programmer hours it takes to implement say an mlp. Wafer scale data flow may not be a solved problem?

tivert大约 1 年前

评论 #39694793 未加载

asdfasdf1大约 1 年前

评论 #39696230 未加载

评论 #39694503 未加载

评论 #39697341 未加载

评论 #39694995 未加载

holoduke大约 1 年前

Better sell all nvidia stocks. Once these chips are common there is no need anymore for GPUs in training super large AI models.

评论 #39696074 未加载

评论 #39695880 未加载

评论 #39697053 未加载

TradingPlaces大约 1 年前

Near-100% yield is some dark magic.

api大约 1 年前

I'm surprised we haven't seen wafer scale many-core CPUs for cloud data centers yet.

评论 #39702266 未加载

beautifulfreak大约 1 年前

So it's increased from 2.6 to 4 trillion transistors over the previous version.

tedivm大约 1 年前

The missing numbers that I really want to see-* Power Usage* Rack Size (last one I played with was 17u)* Cooling requirements

tibbydudeza大约 1 年前

Wow - it as bigger than my kitchen tiles - who uses them ???. NSA ???.

pgraf大约 1 年前

related discussion (2021): <a href="https://news.ycombinator.com/item?id=27459466">https://news.ycombinator.com/item?id=27459466</a>