I'd like to mention a thought I had some time ago regarding the idea of using a byte FP format for ML training: instead of describing a byte in a sign/mantissa/exponent format, it might be advantageous to map the byte the 256 possible FP values, using a lookup table, to ideally chosen values. The curve implemented could be a sigmoid curve, for example. This would reduce quantization effects, likely not only resulting in a better convergence, but consistently so.<p>Maybe it would be necessary to adjust the curve to facilitate the reverse lookup, and reduce the time and silicon needed.
Interesting read. I wonder if this is only some bandwidth optimization to throw more hardware at the problem or an actual shift in perspective, ref no NaN/Inf, instead clamps to maxval. Could this introduce artifacts/will math libs need to code around this, or will this enable some new insight?