TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

SIMD / GPU Friendly Branchless Binary Search

82 pointsby Atrix256almost 8 years ago

10 comments

0xfadedalmost 8 years ago
I was stumped as to how this was possible, until I saw that it relied on memory access relative to a computed offset, for which as far as I know there are no SIMD instructions. I guess it could be argued that a texture in a shader would provide similar functionality.<p>Also of interest are sorting networks <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sorting_network" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sorting_network</a> which can be used to perform branchless sort on parallel hardware.
评论 #14601847 未加载
评论 #14600062 未加载
flgralmost 8 years ago
I think in order to really reap the benefits of this you&#x27;ll want to actually reorganize the in-memory layout of the trees in order to make sure all elements needed for the comparison end up in the same cache line.<p>I haven&#x27;t been following this closely, but the last time I checked scatter-gather loads were really really slow.<p>Chapter 3.3 from page 27 (PDF page 43) on of this be interesting: <a href="https:&#x2F;&#x2F;www.researchgate.net&#x2F;profile&#x2F;Florian_Gross&#x2F;publication&#x2F;275971053_Index_Search_Algorithms_for_Databases_and_Modern_CPUs&#x2F;links&#x2F;554cffca0cf29f836c9cd539.pdf" rel="nofollow">https:&#x2F;&#x2F;www.researchgate.net&#x2F;profile&#x2F;Florian_Gross&#x2F;publicati...</a><p>Also contains a survey of some other related data structures and algorithms.
评论 #14600007 未加载
评论 #14599574 未加载
lukegoalmost 8 years ago
Is it reasonable to estimate that the performance of this code will be bounded by log2(n) memory round-trips per lookup?<p>If so then could you double performance to log2(n)&#x2F;2 lookups by issuing parallel memory requests for location n and also both candidates for location n+1?
评论 #14602374 未加载
roel_valmost 8 years ago
Does anyone know of any comprehensive resources that cover various techniques to do things branchless or otherwise optimize for gpu? I learned about masking the other day, but it seems this sort of stuff is only covered in dispersed blog posts and institutional knowledge.
评论 #14610887 未加载
maxk42almost 8 years ago
This is not branchless.<p>It&#x27;s if-less. But the ternary operator (?:) is a branching instruction.
评论 #14599303 未加载
评论 #14601168 未加载
评论 #14599416 未加载
评论 #14603258 未加载
monocasaalmost 8 years ago
Huh, neat. This is really similar to the underlying mechanism used in the compiler that only used x86 mov instructions.<p><a href="https:&#x2F;&#x2F;github.com&#x2F;xoreaxeaxeax&#x2F;movfuscator" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;xoreaxeaxeax&#x2F;movfuscator</a>
评论 #14601338 未加载
falcolasalmost 8 years ago
If I&#x27;m doing my mental compilation correctly, these can&#x27;t actually be parallelized, due to &#x27;ret&#x27; being referenced and set by each line. I&#x27;m curious if this is actually the case, and if an actual parallel implementation is possible (and how fast).
评论 #14598975 未加载
评论 #14599176 未加载
gwernalmost 8 years ago
Reminds me a little of <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sorting_network" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sorting_network</a>
FrozenVoidalmost 8 years ago
Try GCC vector intrinsics which allows comparing scalar to vector.
partycoderalmost 8 years ago
Technically a comparison operator would be a form of branching, because you have comparisons and jumps based off flags, same as what happens when you compile an if statement.<p>It becomes more evident when you see the generated code. <a href="https:&#x2F;&#x2F;godbolt.org&#x2F;g&#x2F;Mu9KBY" rel="nofollow">https:&#x2F;&#x2F;godbolt.org&#x2F;g&#x2F;Mu9KBY</a><p>e.g:<p><pre><code> cmp rax, QWORD PTR [rbp-24] ; compare a and b ja .L2 ; is a above b?, if so jump to .L2, otherwise proceed </code></pre> In fact this statement:<p><pre><code> size_t ret = (haystack[4] &lt;= needle) * 4; </code></pre> ...is 100% equivalent to this once compiled:<p><pre><code> register size_t ret; if (haystack[4] &lt;= needle) { ret = 4; } else { ret = 0; } </code></pre> So the code is not really branchless, just if-less.
评论 #14601311 未加载