TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How we made Haskell search strings as fast as Rust

246 点作者 duijf大约 6 年前

11 条评论

carterschonwald大约 6 年前
as maintainer of two of the haskell libraries they use, a member of the core libraries committee, and a hackage trustee, i&#x27;d like to throw in my two cents:<p>1) i see they are doing fully qualified imports. please add major bounds to your version deps, stack doesn&#x27;t excuse not tracking what versions you used!<p>2) they have<p>&quot;data Next a = Done !a | Step !a &quot; in their code,<p>this tells me they could have possibly benefited using the stream abstraction in Vector!<p><a href="https:&#x2F;&#x2F;hackage.haskell.org&#x2F;package&#x2F;vector-0.12.0.2&#x2F;docs&#x2F;Data-Vector-Fusion-Stream-Monadic.html" rel="nofollow">https:&#x2F;&#x2F;hackage.haskell.org&#x2F;package&#x2F;vector-0.12.0.2&#x2F;docs&#x2F;Dat...</a> is a link to the latter
评论 #19396690 未加载
whalesalad大约 6 年前
I can&#x27;t speak to the content or Haskell at all but I really like the clean formatting of the post, particularly the charts and images. Does anyone know what tool might have been used to produce them? So clean and crisp.
评论 #19384136 未加载
评论 #19385303 未加载
评论 #19384113 未加载
评论 #19384487 未加载
paulddraper大约 6 年前
Impressive work! The clincher is at the end<p>&gt; Relying so much on optimizations to ensure that very high level code compiles to the right low-level code is fragile. Case in point: after upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with GHC 8.6.3, we saw the running time of our benchmark double.<p>This would scare we about doing any real performance optimization in a language as high-level as Haskell: am I just fiddling with compiler internals now?
dmitriid大约 6 年前
What actually amazes me is that... Python is good enough:<p>&gt; A naive Python script that loops over all needles and calls str.find in a loop to find all matches.<p>&gt; working through all 3.5 gigabytes of test data once<p>&gt; Time is ~6.5s.<p>I love these kinds of benchmarks. They consistently show that for specialised cases you do need to drop down to C&#x2F;Rust&#x2F;hand-written Haskell implementations. In many-many more cases, well, even Python with a naive implementation is good enough.<p>Not to diss Haskell or Rust. Only saying: chose your tools as you see fit, and benchmark.<p>Also kudos to the authors to use mean instead of average in the graphs.
评论 #19384112 未加载
评论 #19383090 未加载
评论 #19384264 未加载
评论 #19383343 未加载
评论 #19387204 未加载
timerol大约 6 年前
&gt; Clearly there was plenty of room for speeding up our string matching. We considered binding the Rust library, but due to the way that Haskell’s foreign function interface interacts with the garbage collector, passing a string to a foreign function requires copying it. That copy alone puts any foreign library at a significant disadvantage, so we decided to see how close we could get in Haskell alone.<p>It&#x27;s a shame Haskell copies strings across FFI boundaries, because binding the Rust version would have been almost as fast, with (presumably) significantly less effort.
评论 #19381711 未加载
评论 #19385049 未加载
wuschel大约 6 年前
Interesting article.<p>Out of curiosity: how much optimizing potential is in the software that was programmed by Burntsushi in Rust?
评论 #19381685 未加载
评论 #19381598 未加载
评论 #19385320 未加载
评论 #19381468 未加载
sevensor大约 6 年前
Why UTF-16? Seems like an odd choice of character encodings.
评论 #19381747 未加载
评论 #19381779 未加载
eridius大约 6 年前
Converting the function to act as a fold instead of just returning a lazy list, is that necessary because all the unboxing and strictness internally meant the matches couldn&#x27;t actually be lazy?
评论 #19383307 未加载
评论 #19385068 未加载
glangdale大约 6 年前
It&#x27;s a pity that Aho-Corasick always seems to be the starting point for this kind of work, because it&#x27;s as slow as a wet week and&#x2F;or the tables are huge (pick your poison).<p>To do this right you need a fast prefilter and a reasonable fast string match. Our string matcher in Hyperscan blew AC away on size and speed, typically. Unfortunately, I never had a chance to build its successor. But AC is a disaster of one dependent load after another and is mediocre at best in terms of performance.
评论 #19388061 未加载
olliej大约 6 年前
This does seem to actively circumvent a number of the things that are always held up as why Haskell is a great language: it uses strict types in a number of places, has explicit inline and no-inline annotations everywhere.<p>It also seemed to require a lot of work to construct the code to make tail call optimization happen, but that’s par for the course in Haskell.
评论 #19383353 未加载
评论 #19383308 未加载
评论 #19382626 未加载
评论 #19384965 未加载
评论 #19384290 未加载
wmu大约 6 年前
&quot;A naive Python script that loops over all needles and calls str.find in a loop to find all matches. This is not Aho–Corasick [...]&quot;.<p>I&#x27;m the author of pyahocrasick, a python C extension that implements the AC. I&#x27;d love to see it in the comparison. :)