Author here:
Let me try to give an overview as I saw some questions repeating itself.<p>* This accelerator is for an Edge/Inference case, so there is no training on this chip.<p>* We introduce a differentiable form of Maddness, allowing Maddness to be used in e2e training and present an application -> ResNet.<p>* We are still in the process of understanding how this will translate to transformers.<p>* The goal was to show that Maddness is feasible with a good codesign of the hardware.<p>* Compared to other extreme quantisation (BNN/TNN) and pruning schemes, this is more general as it replaces the matmul with an approximate matmul.<p>* The model architecture is not fixed in hardware. It is „just“ a matmul unit.<p>I hope this helps :-)
I am surprised that they do not mention comparing against quantized matrix multiplication because their "encoding" appears to be something like a quantization step with unevenly sized buckets. And then their approximate multiplication step to me looks like multiplying a quantized input vector against a 1-bit quantized matrix.<p>But overall this is an extremely exciting development because it shows how one could convert a NN into an efficient hardware implementation. And due to them working only on quantized data with LUTs, one can also embed low-dimensional matrices directly into the silicon.<p>My prediction would be that this will develop in the way that we can soon buy $1 hardware accelerators for things like word embedding, grammar, and general language understanding. And then you need those expensive GPUs only for the last few layers of your LLM, thereby massively reducing deployment costs.<p>EDIT: Reading the actual paper, I saw that this work is also related to LORA because they convert high-dimensional input vectors to a quantized value based on a lower-dimensional embedding which they call "prototypes". So it's a bit like doing LORA with 1-bit quantization but instead of representing it as 8x 1bit flags you represent it as 1x 8bit integer.
Ok, I get excited by seeing the numbers, but can someone please explain in a single sentence where this can be used and how big the overall impact would be?
This is a dumb question but I guess this means that you can't make something like a LoRA in software, right? Because the network is physically hardcoded?
Based on the first figure in the paper, it seems that this scheme effectively turns 8 input values into a 4-bit number, thus giving an effective 0.5-bit quantization.<p>Considering that current aggressive quantization for LLM transformers uses 4 bits, does such a 0.5-bit quantization produce an effective neural network?<p>Does the scheme stay competitive if it is changed to use 4-bit quantization instead of 0.5-bit?