The author says who needs SIMD registers! And indeed on early processors like the 68000 you didn't have dedicated SIMD registers so in order to do SIMD within a register (SWAR) you had to do some variant on the techniques presented.<p>This was used in a fast 4 sample player where it could do math on four 8 bit wave table indices all in one 32 bit operation on the 68000. I had forgotten the details so this was a nice reminder of how it must have worked. I would otherwise have to disassemble it since source no longer exists. Speed was necessary since it was for the Atari ST and the work was done in an interrupt which was quite frequent and it was important to keep the overhead as minimal as possibe.
You can get a fast equivalent to his zero-padding method on recent-ish x86 and ARM SVE. The x86 assembly would look like (untested):<p><pre><code> compare: ;; lhs in eax, rhs in ebx
mov ecx, 0b11111011111101111
pdep eax, eax, ecx
pdep ebx, ebx, ecx
sub eax, ebx
test eax, 0b10000010000001000
mov eax, 0
setnz al
ret</code></pre>
I just benchmarked the three versions (in Rust, with criterion) on my Coffee Lake CPU. The quartiles are:<p>Naive version, best case: [481.40 ps 487.84 ps 494.56 ps]<p>Naive version, mid case: [751.56 ps 758.84 ps 766.17 ps]<p>Naive version, worst case: [953.71 ps 970.95 ps 994.65 ps]<p>Packed bitfield version: [685.98 ps 698.25 ps 711.57 ps]<p>So on average, the naive version and packed bitfield versions are within 10% of one another. Modern CPUs frequently don't benefit much from these kinds of tricks anymore.
The suggested code tests for bitfield borrows with<p><pre><code> auto c = ((~x & y) | (~(x ^ y) & (x - y));
c &= 0x8410;
return c == 0;
</code></pre>
where the literal bitmap contains the most significant bit of each bitfield.<p>Couldn't this be written equivalently as<p><pre><code> auto c = x ^ y ^ (x - y);
c &= 0x10820;
</code></pre>
where the literal bitmap now contains the bits just left of each bitfield?
I saw in a wild bitshift methods text-only-page many years ago that had a method to do an approximate distance between two 2d vectors using only bitwise operators. Anybody know it? I've been unable to find it for about 8~ years
Something I've been greatly enjoying in using Zig is builtin support for packed bitfields in the form of packed structs.<p>The data structure from the article would be:<p><pre><code> const RGB = packed struct(u16) {
b: u5,
g: u6,
r: u5,
}
</code></pre>
So the test becomes<p><pre><code> const gte: bool = x.r >= y.r and x.g >= y.g and x.b >= y.b;
</code></pre>
Okay, so this doesn't get us the optimal form described in the article.<p>Or does it? Since the compiler knows about packed structs, it could perform the optimization for me.<p>Does it, right this instant? Eh, probably not. But compilers have a tendency to improve, and this is a local pattern which would be fairly easy to recognize. The recognition is the point: it's much, much harder to recognize the intent in the initial implementations described in the article, and then replace them with the optimal version. The way I was able to write it in Zig, in addition to being far more expressive of the algorithm's intent, conveys to the compiler: compare the values of these bitfields, I don't care how. The compiler doesn't have to prove that I have no other reason for the other operations, besides making said comparison possible: it can just emit the optimal code.
Interesting optimization, particularly for something like an FPGA implementation but, as the article says, also useful for improving an inner loop. Thanks for sharing it.