科技回声

I am not very familiar with hardware design, so I really appreciate if someone with knowledge in this area could tell me how much performance we could gain by creating an LLM-specific inference hardware. I don't mean e.g. a chip optimized for general transformers, I mean going beyond that and hard-coding the weights of a trained model into the hardware.

Half of bits in any weight will be zero, on average, so those bits of the multiply chains can be removed. Lots of optimization could take place.<p>If you're going to go for the absolute maximum performance, you're going to convert an entire layer from multiply accumulates, etc... to a directed acyclic graph of bitwise logical operations (and, or, xor, nor, nand, etc), then optimize out all of the gates you possibly can before building it into a part of the ASIC. In theory, you could get 100% utilization of the chip area, and one token per clock cycle out. Your limiting factor is going to be power consumption, as 50% of the gates will be toggling every clock (on average).<p>Nobody will do this, though... because developing an ASIC takes 6 months to a year, and the chip would be completely useless for anything else.<p>You could get close with a huge grid of LUTs that only talks to neighbors, it could compute the optimized graph from above, or any other, while keeping all the wires short, and thus all the capacitances low, and thus lower power, higher frequency.

Not a lot. The limiting factor is the wiring and not the size of the elements themselves. So one bit of ROM might be much smaller than one bit of RAM, but it's irrelevant because the size and length of the wires transferring that bit to the processing elements remain the same.

Ask HN: How much performance can be gained by etching LLM weights into hardware?

2 条评论

Ask HN: How much performance can be gained by etching LLM weights into hardware?

2 条评论