Dividing unsigned 8-bit numbers

162 pointsby mfiguiere5 months ago

16 comments

orlp5 months ago

What's not mentioned is that in most cases you have a constant divisor which lets you replace division by multiplication with the reciprocal. The reciprocal can be rounded to a nearby dyadic rational, letting you do the division with a right-shift.For example, 8-bit division by 3 is equivalent to widening multiplication by 171 followed by a right-shift of 9, as 171/2^9 = 0.333984375 which is close enough to 1/3 that the results match exactly.

评论 #42482159 未加载

评论 #42488066 未加载

评论 #42483213 未加载

评论 #42482581 未加载

评论 #42482571 未加载

评论 #42494035 未加载

评论 #42482565 未加载

评论 #42483414 未加载

评论 #42488798 未加载

评论 #42486537 未加载

评论 #42483639 未加载

mlochbaum5 months ago

We implemented something like this in CBQN last year (mainly for modulus, as floor division isn't a primitive). Commit is <a href="https://github.com/dzaima/CBQN/commit/d333902">https://github.com/dzaima/CBQN/commit/d333902</a>, some proofs of when and why it works at <a href="https://mlochbaum.github.io/BQN/implementation/primitive/arithmetic.html#integer-division-with-floats" rel="nofollow">https://mlochbaum.github.io/BQN/implementation/primitive/ari...</a>.

RaisingSpear5 months ago

For CPUs that support AVX-512 VBMI, there's a faster reciprocal-based approach: <a href="https://avereniect.github.io/2023/04/29/uint8_division_using_avx512.html" rel="nofollow">https://avereniect.github.io/2023/04/29/uint8_division_using...</a>A VBMI2 example implementation can be found in the function named divide_with_lookup_16b: <a href="https://godbolt.org/z/xdE1dx5Pj" rel="nofollow">https://godbolt.org/z/xdE1dx5Pj</a>

AlotOfReading5 months ago

I'm not certain it'll actually be faster, but you should be able to merge the reciprocal, multiplication and rounding adjustment into a single fma in log space via evil floating point bit hacks. Then you'd just be paying the cost of converting to and from the float representation of the integer.

评论 #42481901 未加载

dzaima5 months ago

Here's a version of the vrcpps idea, doing whole vectors of 32 u8's, bypassing lane crossing and somewhat simplifying the packing/unpacking, that's ~1.5x faster on Haswell: <a href="https://godbolt.org/z/TExGahv3z" rel="nofollow">https://godbolt.org/z/TExGahv3z</a>

评论 #42483581 未加载

foundry275 months ago

I think the approximate reciprocal approach is interesting here. The doc mentions multiplying the dividend by ~1.00025 in the math to avoid FP error so you don’t end up off-by-one after truncation, but I think this hack is still incomplete! On some inputs (like 255, or other unlucky divisors near powers of two), you might get borderline rounding behaviour that flips a bit of the final integer. It’s easy to forget that single-precision floats don’t line up neatly with every 8bit integer ratio in real code, and a single off-by-one can break pixel ops or feed subtle bugs into a bigger pipeline.I suspect a hybrid scheme like using approximate reciprocals for most values but punting to scalar for unlucky ones could handle these corner cases without killing performance. That’d be interesting to benchmark

评论 #42482149 未加载

ryao5 months ago

Why not use libdivide?<a href="https://libdivide.com/" rel="nofollow">https://libdivide.com/</a>

评论 #42482167 未加载

synthos5 months ago

Newton Raphson could be used to calculate a reciprocal. (in a bit width larger than 8). If starting from a good reciprocal approximation, convergence to bit accurate reciprocal should not take many iterations. Then multiply the reciprocal with the numerator to perform the divide

jonhohle5 months ago

I do a fair amount of decompiling and the list of constants is awesome! I usually can make a reasonable guess or have intuition, but frequently end up stepping through several integers to find the right value. Having a handy lookup table will be great.

ack_complete5 months ago

Think the SSE2 implementation could be tightened up by using the same register for the dividend and the quotient, shifting the quotient bits into the dividend as the dividend bits are shifted out. This was common practice in software CPU division routines.

purplesyringa5 months ago

It's odd that the only reason AVX-512 long division wins in the end is that it's compared to AVX2 reciprocal. Would it be possible to compare it to AVX-512 reciprocal computation?

Cold_Miserable5 months ago

Heh? Surely fast convert 8-bit int to 16-bit FP,rcp+mul/div is a no-brainer? edit make that fast convert,rcp,fma (float 16 constant 1.0) and xor (same constant)

评论 #42483918 未加载

评论 #42483906 未加载

xpe5 months ago

The bit-division burns my eyes and makes a mockery of my puny intellect, my liege. Mine fellow vassals possess skills most frightful in their potency.

abcd_f5 months ago

Since it's 8bit by 8bit, a precomputed lookup table (64K in size) will be another option.

评论 #42482147 未加载

评论 #42481822 未加载

评论 #42481859 未加载

afstr5 months ago

M I’m looking I’m I’m O ; r

dataflow5 months ago

> SIMD ISAs usually do not provide the integer division; the only known exception is RISC-V Vector ExtensionIt's kind of funny to read "the only known exception is..." in this context. What would an unknown exception be - an ISA that accidentally implements this but that the author believes nobody is aware of yet?More seriously, I actually don't understand the intended meaning here. I assume the author means "out of all the ISAs I know"? What is that set of ISAs?

评论 #42482316 未加载

评论 #42482177 未加载

评论 #42482283 未加载

评论 #42482121 未加载

评论 #42482148 未加载

评论 #42482254 未加载