Half of bits in any weight will be zero, on average, so those bits of the multiply chains can be removed. Lots of optimization could take place.<p>If you're going to go for the absolute maximum performance, you're going to convert an entire layer from multiply accumulates, etc... to a directed acyclic graph of bitwise logical operations (and, or, xor, nor, nand, etc), then optimize out all of the gates you possibly can before building it into a part of the ASIC. In theory, you could get 100% utilization of the chip area, and one token per clock cycle out. Your limiting factor is going to be power consumption, as 50% of the gates will be toggling every clock (on average).<p>Nobody will do this, though... because developing an ASIC takes 6 months to a year, and the chip would be completely useless for anything else.<p>You could get close with a huge grid of LUTs that only talks to neighbors, it could compute the optimized graph from above, or any other, while keeping all the wires short, and thus all the capacitances low, and thus lower power, higher frequency.