TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Sort Faster with FPGAs

86 点作者 szczys超过 9 年前

13 条评论

vvanders超过 9 年前
I wonder how quickly routing resources will cause the fan-out time to hurt the speed of sort as well.<p>Also fun is Radix Sort. If you can guarantee a fixed size key(say 32bits) a Radix Sort will do O(kN) where k is usually 1-4 based on your cache line size. Destroys any other sort due to linear data reads by almost an order of magnitude in some cases. I&#x27;ve used on on some previous gigs to sort thousands of quads front-to-back in record time on some older ARM chips.
评论 #10941732 未加载
评论 #10941843 未加载
kbwt超过 9 年前
A better solution which only uses local connections between adjacent cells is Odd-Even Transposition Sort.<p>When the cells are arranged in a 2D grid, Shearsort sorts all the values in snake-like order in O(sqrt(n) log(n)). A slightly more complicated algorithm based on Shearsort is that of Schnorr and Shamir, which runs in O(sqrt(n)).
theocean154超过 9 年前
There&#x27;s a far better way to do this, which essentially decomposes into an insertion sort. Details here: <a href="https:&#x2F;&#x2F;github.com&#x2F;theocean154&#x2F;dau&#x2F;tree&#x2F;master&#x2F;sort" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;theocean154&#x2F;dau&#x2F;tree&#x2F;master&#x2F;sort</a>
评论 #10942070 未加载
alain94040超过 9 年前
O(N) is not particularly insightful for fixed&#x2F;limited values of N. This would be more interesting if the author reported sorts&#x2F;s with the FPGA vs. software. Because obviously, as N grows, the FPGA only performs the full sort in &quot;one cycle&quot; if you are willing to lower the cycle time more and more (until the FPGA blows up).
评论 #10942274 未加载
mabbo超过 9 年前
&gt; each sorting step is fast, bubbling down to N identical and <i>independent questions asked in parallel</i> at every clock cycle.<p>&gt; The first empty cell could take the new data since it’s empty, but since 4 just got kicked out from the cell above, this empty cell gets the 4 instead<p>These are incompatible. The result in cell 3 is dependent on the result of cell 2 (and so on up the chain of cells).<p>Therefore, the steps have to run in order, making this O(n) just for adding one item to the set.<p>The end goal might still be possible, but I don&#x27;t see how this can work.
评论 #10943496 未加载
评论 #10943353 未加载
jcbeard超过 9 年前
Here&#x27;s an old but great reference on the subject: <a href="http:&#x2F;&#x2F;www.ccrc.wustl.edu&#x2F;~roger&#x2F;papers&#x2F;cg09.pdf" rel="nofollow">http:&#x2F;&#x2F;www.ccrc.wustl.edu&#x2F;~roger&#x2F;papers&#x2F;cg09.pdf</a><p>I used to teach digital design, sorting was one of the first topics for FPGA&#x27;s :)
e19293001超过 9 年前
A much more faster sorting algorithm implementation for hardware is sorting networks[1]. This can be implemented with just comparators and wires. Thus, there&#x27;s no clock involved since it is a combinational circuit. It&#x27;s quite simple circuit than can be built from bottom to top with logic gates. Knuth&#x27;s TAOCP has a good explanation. Additionally, there are lots of information on the internet.<p>---<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sorting_network" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Sorting_network</a>
melted超过 9 年前
We&#x27;ll see a lot of FPGA use in the coming years. Consider this: newest FPGAs have 1TB&#x2F;s (that&#x27;s a terabyte per second, not terabit) memory bandwidth and 1KB-wide memory bus. Short of developing your own silicon (which takes years and tons of money), you simply can&#x27;t get this kind of throughput with anything else. That&#x27;s basically why Intel bought Altera.
评论 #10942632 未加载
bedros超过 9 年前
I built a test chip for a class project in college that did sorting in linear time, without all the routing issues of this algorithm.
评论 #10941738 未加载
jhallenworld超过 9 年前
The article referenced in the comments is very cool:<p><a href="http:&#x2F;&#x2F;citeseerx.ist.psu.edu&#x2F;viewdoc&#x2F;download?doi=10.1.1.83.5786&amp;rep=rep1&amp;type=pdf" rel="nofollow">http:&#x2F;&#x2F;citeseerx.ist.psu.edu&#x2F;viewdoc&#x2F;download?doi=10.1.1.83....</a><p>It seems related to certain pipeline bypass structures.
keithwhor超过 9 年前
I have the pleasure of working with Josh, and first heard him talk about this in a car ride to get lunch. :) Very cool stuff, and very excited to see the things he builds in the future.
theocean154超过 9 年前
rearranged my repo, here is the scalable, pipelined way of doing this, it ends up being an insertion sort with the benefit of small fanout (for lower energy), better scalability (faster due to lower fanout and shorter critical path with pipelining), and maintains O(1) throughput (no fill&#x2F;drain cycle).<p><a href="https:&#x2F;&#x2F;github.com&#x2F;theocean154&#x2F;dau&#x2F;blob&#x2F;master&#x2F;docs&#x2F;part_sort_block_diagram.pdf" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;theocean154&#x2F;dau&#x2F;blob&#x2F;master&#x2F;docs&#x2F;part_sor...</a>
评论 #10944071 未加载
TickleSteve超过 9 年前
Trades time for parallelism.<p>This isn&#x27;t going to scale to large data sets...