What's not mentioned is that in most cases you have a <i>constant</i> divisor which lets you replace division by multiplication with the reciprocal. The reciprocal can be rounded to a nearby dyadic rational, letting you do the division with a right-shift.<p>For example, 8-bit division by 3 is equivalent to widening multiplication by 171 followed by a right-shift of 9, as 171/2^9 = 0.333984375 which is close enough to 1/3 that the results match exactly.
We implemented something like this in CBQN last year (mainly for modulus, as floor division isn't a primitive). Commit is <a href="https://github.com/dzaima/CBQN/commit/d333902">https://github.com/dzaima/CBQN/commit/d333902</a>, some proofs of when and why it works at <a href="https://mlochbaum.github.io/BQN/implementation/primitive/arithmetic.html#integer-division-with-floats" rel="nofollow">https://mlochbaum.github.io/BQN/implementation/primitive/ari...</a>.
For CPUs that support AVX-512 VBMI, there's a faster reciprocal-based approach: <a href="https://avereniect.github.io/2023/04/29/uint8_division_using_avx512.html" rel="nofollow">https://avereniect.github.io/2023/04/29/uint8_division_using...</a><p>A VBMI2 example implementation can be found in the function named divide_with_lookup_16b: <a href="https://godbolt.org/z/xdE1dx5Pj" rel="nofollow">https://godbolt.org/z/xdE1dx5Pj</a>
I'm not certain it'll actually be faster, but you should be able to merge the reciprocal, multiplication and rounding adjustment into a single fma in log space via evil floating point bit hacks. Then you'd just be paying the cost of converting to and from the float representation of the integer.
Here's a version of the vrcpps idea, doing whole vectors of 32 u8's, bypassing lane crossing and somewhat simplifying the packing/unpacking, that's ~1.5x faster on Haswell: <a href="https://godbolt.org/z/TExGahv3z" rel="nofollow">https://godbolt.org/z/TExGahv3z</a>
I think the approximate reciprocal approach is interesting here. The doc mentions multiplying the dividend by ~1.00025 in the math to avoid FP error so you don’t end up off-by-one after truncation, but I think this hack is still incomplete! On some inputs (like 255, or other unlucky divisors near powers of two), you might get borderline rounding behaviour that flips a bit of the final integer. It’s easy to forget that single-precision floats don’t line up neatly with every 8bit integer ratio in real code, and a single off-by-one can break pixel ops or feed subtle bugs into a bigger pipeline.<p>I suspect a hybrid scheme like using approximate reciprocals for most values but punting to scalar for unlucky ones could handle these corner cases without killing performance. That’d be interesting to benchmark
Newton Raphson could be used to calculate a reciprocal. (in a bit width larger than 8). If starting from a good reciprocal approximation, convergence to bit accurate reciprocal should not take many iterations. Then multiply the reciprocal with the numerator to perform the divide
I do a fair amount of decompiling and the list of constants is awesome! I usually can make a reasonable guess or have intuition, but frequently end up stepping through several integers to find the right value. Having a handy lookup table will be great.
Think the SSE2 implementation could be tightened up by using the same register for the dividend and the quotient, shifting the quotient bits into the dividend as the dividend bits are shifted out. This was common practice in software CPU division routines.
It's odd that the only reason AVX-512 long division wins in the end is that it's compared to AVX2 reciprocal. Would it be possible to compare it to AVX-512 reciprocal computation?
Heh?
Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a no-brainer?
edit make that fast convert,rcp,fma (float 16 constant 1.0) and xor (same constant)
> SIMD ISAs usually do not provide the integer division; the only known exception is RISC-V Vector Extension<p>It's kind of funny to read "the only known exception is..." in this context. What would an unknown exception be - an ISA that accidentally implements this but that the author believes nobody is aware of yet?<p>More seriously, I actually don't understand the intended meaning here. I assume the author means "out of all the ISAs I know"? What is that set of ISAs?