TechEcho

9 comments

djmipsabout 1 year ago

The author says who needs SIMD registers! And indeed on early processors like the 68000 you didn't have dedicated SIMD registers so in order to do SIMD within a register (SWAR) you had to do some variant on the techniques presented.This was used in a fast 4 sample player where it could do math on four 8 bit wave table indices all in one 32 bit operation on the 68000. I had forgotten the details so this was a nice reminder of how it must have worked. I would otherwise have to disassemble it since source no longer exists. Speed was necessary since it was for the Atari ST and the work was done in an interrupt which was quite frequent and it was important to keep the overhead as minimal as possibe.

评论 #40449706 未加载

anematodeabout 1 year ago

You can get a fast equivalent to his zero-padding method on recent-ish x86 and ARM SVE. The x86 assembly would look like (untested):<pre><code> compare: ;; lhs in eax, rhs in ebx mov ecx, 0b11111011111101111 pdep eax, eax, ecx pdep ebx, ebx, ecx sub eax, ebx test eax, 0b10000010000001000 mov eax, 0 setnz al ret</code></pre>

pcwaltonabout 1 year ago

I just benchmarked the three versions (in Rust, with criterion) on my Coffee Lake CPU. The quartiles are:Naive version, best case: [481.40 ps 487.84 ps 494.56 ps]Naive version, mid case: [751.56 ps 758.84 ps 766.17 ps]Naive version, worst case: [953.71 ps 970.95 ps 994.65 ps]Packed bitfield version: [685.98 ps 698.25 ps 711.57 ps]So on average, the naive version and packed bitfield versions are within 10% of one another. Modern CPUs frequently don't benefit much from these kinds of tricks anymore.

trompabout 1 year ago

The suggested code tests for bitfield borrows with<pre><code> auto c = ((~x & y) | (~(x ^ y) & (x - y)); c &= 0x8410; return c == 0; </code></pre> where the literal bitmap contains the most significant bit of each bitfield.Couldn't this be written equivalently as<pre><code> auto c = x ^ y ^ (x - y); c &= 0x10820; </code></pre> where the literal bitmap now contains the bits just left of each bitfield?

评论 #40447425 未加载

mannyvabout 1 year ago

Is the vector method really faster than mask/subtract/compare?They never really say in the article.

评论 #40444210 未加载

评论 #40444798 未加载

评论 #40446379 未加载

MrLeapabout 1 year ago

I saw in a wild bitshift methods text-only-page many years ago that had a method to do an approximate distance between two 2d vectors using only bitwise operators. Anybody know it? I've been unable to find it for about 8~ years

samatmanabout 1 year ago

Something I've been greatly enjoying in using Zig is builtin support for packed bitfields in the form of packed structs.The data structure from the article would be:<pre><code> const RGB = packed struct(u16) { b: u5, g: u6, r: u5, } </code></pre> So the test becomes<pre><code> const gte: bool = x.r >= y.r and x.g >= y.g and x.b >= y.b; </code></pre> Okay, so this doesn't get us the optimal form described in the article.Or does it? Since the compiler knows about packed structs, it could perform the optimization for me.Does it, right this instant? Eh, probably not. But compilers have a tendency to improve, and this is a local pattern which would be fairly easy to recognize. The recognition is the point: it's much, much harder to recognize the intent in the initial implementations described in the article, and then replace them with the optimal version. The way I was able to write it in Zig, in addition to being far more expressive of the algorithm's intent, conveys to the compiler: compare the values of these bitfields, I don't care how. The compiler doesn't have to prove that I have no other reason for the other operations, besides making said comparison possible: it can just emit the optimal code.

评论 #40445635 未加载

not2babout 1 year ago

Interesting optimization, particularly for something like an FPGA implementation but, as the article says, also useful for improving an inner loop. Thanks for sharing it.

评论 #40449739 未加载

082349872349872about 1 year ago

compare <a href="https://dl.acm.org/doi/pdf/10.1145/800136.804488" rel="nofollow">https://dl.acm.org/doi/pdf/10.1145/800136.804488</a>

9 comments

djmipsabout 1 year ago

评论 #40449706 未加载

anematodeabout 1 year ago

pcwaltonabout 1 year ago

trompabout 1 year ago

评论 #40447425 未加载

mannyvabout 1 year ago

Is the vector method really faster than mask/subtract/compare?They never really say in the article.

评论 #40444210 未加载

评论 #40444798 未加载

评论 #40446379 未加载

MrLeapabout 1 year ago

samatmanabout 1 year ago

评论 #40445635 未加载

not2babout 1 year ago

Interesting optimization, particularly for something like an FPGA implementation but, as the article says, also useful for improving an inner loop. Thanks for sharing it.

评论 #40449739 未加载

082349872349872about 1 year ago

compare <a href="https://dl.acm.org/doi/pdf/10.1145/800136.804488" rel="nofollow">https://dl.acm.org/doi/pdf/10.1145/800136.804488</a>

How to compare two packed bitfields without having to unpack each field (2019)

9 comments

How to compare two packed bitfields without having to unpack each field (2019)

9 comments