TechEcho

10 comments

0xfadedalmost 8 years ago

I was stumped as to how this was possible, until I saw that it relied on memory access relative to a computed offset, for which as far as I know there are no SIMD instructions. I guess it could be argued that a texture in a shader would provide similar functionality.Also of interest are sorting networks <a href="https://en.wikipedia.org/wiki/Sorting_network" rel="nofollow">https://en.wikipedia.org/wiki/Sorting_network</a> which can be used to perform branchless sort on parallel hardware.

评论 #14601847 未加载

评论 #14600062 未加载

flgralmost 8 years ago

I think in order to really reap the benefits of this you'll want to actually reorganize the in-memory layout of the trees in order to make sure all elements needed for the comparison end up in the same cache line.I haven't been following this closely, but the last time I checked scatter-gather loads were really really slow.Chapter 3.3 from page 27 (PDF page 43) on of this be interesting: <a href="https://www.researchgate.net/profile/Florian_Gross/publication/275971053_Index_Search_Algorithms_for_Databases_and_Modern_CPUs/links/554cffca0cf29f836c9cd539.pdf" rel="nofollow">https://www.researchgate.net/profile/Florian_Gross/publicati...</a>Also contains a survey of some other related data structures and algorithms.

评论 #14600007 未加载

评论 #14599574 未加载

lukegoalmost 8 years ago

Is it reasonable to estimate that the performance of this code will be bounded by log2(n) memory round-trips per lookup?If so then could you double performance to log2(n)/2 lookups by issuing parallel memory requests for location n and also both candidates for location n+1?

评论 #14602374 未加载

roel_valmost 8 years ago

Does anyone know of any comprehensive resources that cover various techniques to do things branchless or otherwise optimize for gpu? I learned about masking the other day, but it seems this sort of stuff is only covered in dispersed blog posts and institutional knowledge.

评论 #14610887 未加载

maxk42almost 8 years ago

This is not branchless.It's if-less. But the ternary operator (?:) is a branching instruction.

评论 #14599303 未加载

评论 #14601168 未加载

评论 #14599416 未加载

评论 #14603258 未加载

monocasaalmost 8 years ago

Huh, neat. This is really similar to the underlying mechanism used in the compiler that only used x86 mov instructions.<a href="https://github.com/xoreaxeaxeax/movfuscator" rel="nofollow">https://github.com/xoreaxeaxeax/movfuscator</a>

评论 #14601338 未加载

falcolasalmost 8 years ago

If I'm doing my mental compilation correctly, these can't actually be parallelized, due to 'ret' being referenced and set by each line. I'm curious if this is actually the case, and if an actual parallel implementation is possible (and how fast).

评论 #14598975 未加载

评论 #14599176 未加载

gwernalmost 8 years ago

Reminds me a little of <a href="https://en.wikipedia.org/wiki/Sorting_network" rel="nofollow">https://en.wikipedia.org/wiki/Sorting_network</a>

FrozenVoidalmost 8 years ago

Try GCC vector intrinsics which allows comparing scalar to vector.

partycoderalmost 8 years ago

Technically a comparison operator would be a form of branching, because you have comparisons and jumps based off flags, same as what happens when you compile an if statement.It becomes more evident when you see the generated code. <a href="https://godbolt.org/g/Mu9KBY" rel="nofollow">https://godbolt.org/g/Mu9KBY</a>e.g:<pre><code> cmp rax, QWORD PTR [rbp-24] ; compare a and b ja .L2 ; is a above b?, if so jump to .L2, otherwise proceed </code></pre> In fact this statement:<pre><code> size_t ret = (haystack[4] <= needle) * 4; </code></pre> ...is 100% equivalent to this once compiled:<pre><code> register size_t ret; if (haystack[4] <= needle) { ret = 4; } else { ret = 0; } </code></pre> So the code is not really branchless, just if-less.

评论 #14601311 未加载

10 comments

0xfadedalmost 8 years ago

评论 #14601847 未加载

评论 #14600062 未加载

flgralmost 8 years ago

评论 #14600007 未加载

评论 #14599574 未加载

lukegoalmost 8 years ago

评论 #14602374 未加载

roel_valmost 8 years ago

评论 #14610887 未加载

maxk42almost 8 years ago

This is not branchless.It's if-less. But the ternary operator (?:) is a branching instruction.

评论 #14599303 未加载

评论 #14601168 未加载

评论 #14599416 未加载

评论 #14603258 未加载

monocasaalmost 8 years ago

评论 #14601338 未加载

falcolasalmost 8 years ago

评论 #14598975 未加载

评论 #14599176 未加载

gwernalmost 8 years ago

Reminds me a little of <a href="https://en.wikipedia.org/wiki/Sorting_network" rel="nofollow">https://en.wikipedia.org/wiki/Sorting_network</a>

FrozenVoidalmost 8 years ago

Try GCC vector intrinsics which allows comparing scalar to vector.

partycoderalmost 8 years ago

评论 #14601311 未加载

SIMD / GPU Friendly Branchless Binary Search

10 comments

SIMD / GPU Friendly Branchless Binary Search

10 comments