I'd like to mention a thought I had some time ago regarding the idea of using a byte FP format for ML training: instead of describing a byte in a sign/mantissa/exponent format, it might be advantageous to map the byte the 256 possible FP values, using a lookup table, to ideally chosen values. The curve implemented could be a sigmoid curve, for example. This would reduce quantization effects, likely not only resulting in a better convergence, but consistently so.<p>Maybe it would be necessary to adjust the curve to facilitate the reverse lookup, and reduce the time and silicon needed.