TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Fastest sort of fixed length 6 int array

80 点作者 g-garron超过 13 年前

7 条评论

psykotic超过 13 年前
His xor-laden branchless min/max code in the fastest version so far, sort6_sorting_network_v4, is some seriously janky shit. It's making it harder for the compiler to generate good platform-specific machine code. That initial impression was confirmed after looking at the x64 assembly code generated by gcc -O3. It was awful. I rewrote his min/max macros to use plain old ternary operators. The generated assembly code looked a lot better, and my code beats his by 2-2.5x on my late 2008 MacBook Pro:<p><pre><code> $ gcc -O3 sortnet -o sortnet &#38;&#38; ./sortnet sort6_sorting_network_v4: time = 988 cycles sort6_sorting_network_per: time = 390 cycles </code></pre> That win is without even trying. Just from eyeballing those cycle counts, I'd expect you could wring a lot more speed out of this. My coworker Fabian (@rygorous) immediately suggested using the old difference trick for SORT2(x,y) rather than calculating min and max independently. That brings it down further by about 50% in my benchmark. Fabian estimates that's still about 2.5 times off from optimal: "Best bit-magic solution is ~5 cycle dep chain, stupid simple mov / cmp / cmovg / cmovg boils down to 2."<p>His benchmarking harness also leaves a lot to be desired. But that's another issue. He also says that this is eventually intended for use on a GPU, but he's ranking the implementations by performance on a deeply pipelined out-of-order CPU. Modern GPU cores have some minimal scoreboarding but none of them do the full dance of register renaming and decoupled execution units.
5hoom超过 13 年前
This is the sort of stuff that's great to read because you can't help but learn something.<p>I'd never never heard of sorting networks or oblivious sorting before, and it's good to see some answers other than "Quicksort!"
评论 #2920803 未加载
skimbrel超过 13 年前
Wow, thanks for the link. I'd never heard of sorting networks before. The first thing I thought of was "use an XOR swap to cut memory accesses"...
评论 #2921019 未加载
评论 #2920688 未加载
hxa7241超过 13 年前
To be a little more speculative . . . if it is on GPU, is there some way of using rendering operations? The 'hidden-surface problem' in graphics has been described as basically a sorting problem: so represent each number as a triangle Z . . . well, maybe it would not be very fast, but you could do a few million at once!
评论 #2921217 未加载
drv超过 13 年前
The premise is interesting, but why compare results running on general-purpose CPUs when the goal is to run on a GPU? At this level of optimization, comparing two vastly different architectures and compilers would make any results pretty irrelevant.
mrcapers超过 13 年前
could anyone recommend some reading on theory and implementation of sorting networks?
评论 #2920648 未加载
pmiller2超过 13 年前
Related: Minimum-comparison sorting (<a href="http://www.mimuw.edu.pl/~marpe/research/index.html#MCS" rel="nofollow">http://www.mimuw.edu.pl/~marpe/research/index.html#MCS</a>)