I was stumped as to how this was possible, until I saw that it relied on memory access relative to a computed offset, for which as far as I know there are no SIMD instructions. I guess it could be argued that a texture in a shader would provide similar functionality.<p>Also of interest are sorting networks <a href="https://en.wikipedia.org/wiki/Sorting_network" rel="nofollow">https://en.wikipedia.org/wiki/Sorting_network</a> which can be used to perform branchless sort on parallel hardware.
I think in order to really reap the benefits of this you'll want to actually reorganize the in-memory layout of the trees in order to make sure all elements needed for the comparison end up in the same cache line.<p>I haven't been following this closely, but the last time I checked scatter-gather loads were really really slow.<p>Chapter 3.3 from page 27 (PDF page 43) on of this be interesting: <a href="https://www.researchgate.net/profile/Florian_Gross/publication/275971053_Index_Search_Algorithms_for_Databases_and_Modern_CPUs/links/554cffca0cf29f836c9cd539.pdf" rel="nofollow">https://www.researchgate.net/profile/Florian_Gross/publicati...</a><p>Also contains a survey of some other related data structures and algorithms.
Is it reasonable to estimate that the performance of this code will be bounded by log2(n) memory round-trips per lookup?<p>If so then could you double performance to log2(n)/2 lookups by issuing parallel memory requests for location n and also both candidates for location n+1?
Does anyone know of any comprehensive resources that cover various techniques to do things branchless or otherwise optimize for gpu? I learned about masking the other day, but it seems this sort of stuff is only covered in dispersed blog posts and institutional knowledge.
Huh, neat. This is really similar to the underlying mechanism used in the compiler that only used x86 mov instructions.<p><a href="https://github.com/xoreaxeaxeax/movfuscator" rel="nofollow">https://github.com/xoreaxeaxeax/movfuscator</a>
If I'm doing my mental compilation correctly, these can't actually be parallelized, due to 'ret' being referenced and set by each line. I'm curious if this is actually the case, and if an actual parallel implementation is possible (and how fast).
Reminds me a little of <a href="https://en.wikipedia.org/wiki/Sorting_network" rel="nofollow">https://en.wikipedia.org/wiki/Sorting_network</a>
Technically a comparison operator would be a form of branching, because you have comparisons and jumps based off flags, same as what happens when you compile an if statement.<p>It becomes more evident when you see the generated code. <a href="https://godbolt.org/g/Mu9KBY" rel="nofollow">https://godbolt.org/g/Mu9KBY</a><p>e.g:<p><pre><code> cmp rax, QWORD PTR [rbp-24] ; compare a and b
ja .L2 ; is a above b?, if so jump to .L2, otherwise proceed
</code></pre>
In fact this statement:<p><pre><code> size_t ret = (haystack[4] <= needle) * 4;
</code></pre>
...is 100% equivalent to this once compiled:<p><pre><code> register size_t ret;
if (haystack[4] <= needle) {
ret = 4;
} else {
ret = 0;
}
</code></pre>
So the code is not really branchless, just if-less.