TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: How much performance can be gained by etching LLM weights into hardware?

3 点作者 hexomancer5 个月前
I am not very familiar with hardware design, so I really appreciate if someone with knowledge in this area could tell me how much performance we could gain by creating an LLM-specific inference hardware. I don't mean e.g. a chip optimized for general transformers, I mean going beyond that and hard-coding the weights of a trained model into the hardware.

2 条评论

mikewarot5 个月前
Half of bits in any weight will be zero, on average, so those bits of the multiply chains can be removed. Lots of optimization could take place.<p>If you&#x27;re going to go for the absolute maximum performance, you&#x27;re going to convert an entire layer from multiply accumulates, etc... to a directed acyclic graph of bitwise logical operations (and, or, xor, nor, nand, etc), then optimize out all of the gates you possibly can before building it into a part of the ASIC. In theory, you could get 100% utilization of the chip area, and one token per clock cycle out. Your limiting factor is going to be power consumption, as 50% of the gates will be toggling every clock (on average).<p>Nobody will do this, though... because developing an ASIC takes 6 months to a year, and the chip would be completely useless for anything else.<p>You could get close with a huge grid of LUTs that only talks to neighbors, it could compute the optimized graph from above, or any other, while keeping all the wires short, and thus all the capacitances low, and thus lower power, higher frequency.
bjourne5 个月前
Not a lot. The limiting factor is the wiring and not the size of the elements themselves. So one bit of ROM might be much smaller than one bit of RAM, but it&#x27;s irrelevant because the size and length of the wires transferring that bit to the processing elements remain the same.