Author here. Ask me anything--happy to answer questions.<p>Also, if you like this kind of work, you might like what I've been building for the past year: Composer [1]. It speeds up neural net training by a lot (e.g., 7x faster for ResNet-50) [2] and, in contrast to Bolt/MADDNESS, is polished, documented code you can get working in <5min.<p>[1] <a href="https://github.com/mosaicml/composer" rel="nofollow">https://github.com/mosaicml/composer</a><p>[2] <a href="https://www.mosaicml.com/blog/mosaic-resnet" rel="nofollow">https://www.mosaicml.com/blog/mosaic-resnet</a>
Maddness is their more recent work and yields 100x speedups: <a href="https://arxiv.org/pdf/2106.10860.pdf" rel="nofollow">https://arxiv.org/pdf/2106.10860.pdf</a><p>The code for Maddness is in the same github repo if you search for "Mithral".<p>SIMD instructions can work wonders in the right context.
> If you have a large collection of mostly-dense vectors and can tolerate lossy compression, Bolt can probably save you 10-200x space and compute time.<p>Space. It can save space.<p>The main limitation of fast ML models nowadays is how much parameters you can load in your GPU memory, and these are usually matrices.<p>200x would allow me to run GPT-3 on my old GTX 1050.<p>Frameworks, please implement this NOW!
This is actually from a paper published last year:<p><a href="https://www.reddit.com/r/MachineLearning/comments/pffoo8/r_multiplying_matrices_without_multiplying/" rel="nofollow">https://www.reddit.com/r/MachineLearning/comments/pffoo8/r_m...</a><p>A few questions:<p>- Do some ML frameworks implement it already?
- It promises up to 200x compression, is it reasonable to expect it to allow us to run GPT-3 on smaller mainstream GPUs?
THis sounds and looks impressive, but this part struck me:<p>"If you ... and can tolerate lossy compression"<p>What does this mean? I wouldn't have thought that matrix operations can be lossy. Does anybody know to what extend they are lossy and where this would be acceptable?
This looks good. Why do the vectors have to be dense? Just because of overhead/speed gain being the lowest? Just asking if you could use it universally for all operations if I don't know the density.
I guess the naive approach, if we wanted to do a quick lossy matrix multipy, would be to take the truncated SVD and use that. How does this library compare to the boring strategy, I wonder?