TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Show HN: Krep a High-Performance String Search Utility Written in C

173 点作者 daviducolo2 个月前

19 条评论

johnisgood2 个月前
Those CPU features (AVX2 and whatnot) need to be detected at runtime, too, however.<p>Those ifdefs only detect if the compiler supports them, i.e. at build-time only.<p>So... your program only compiles with AVX2 and others if the compiler supports them; so you should compile where the compiler has all those features (because you want everything to be compiled into one executable, of course), and then use runtime checks to make sure the CPU on which the program is run has actually support for AVX2, for example, as it can select the best implementation based on the available CPU features.<p>To make things a bit more complicated, let me quote a part from one of the projects he has: &quot;The detection is performed at configure time through both CPUID flags and actual instruction execution tests on the host machine, verifying support in both the CPU and operating system.&quot;. Currently what you are doing is the &quot;OS&quot;, or rather, compiler, since you are using only macro definitions.<p>Once you add this, then &quot;Automatically leverages SSE4.2 and AVX2 instructions when available for maximum throughput.&quot; from the list of features on the website will be correct &#x2F; accurate.<p>If interested, someone I know (or rather, follow) has a single header file for detecting CPU features at runtime (for C), and he also has a build-time detection one, but that has much more features.
评论 #43339808 未加载
评论 #43339194 未加载
daviducolo2 个月前
You can read my blog post about the project at <a href="https:&#x2F;&#x2F;dev.to&#x2F;daviducolo&#x2F;introducing-krep-building-a-high-performance-string-search-utility-2pdo" rel="nofollow">https:&#x2F;&#x2F;dev.to&#x2F;daviducolo&#x2F;introducing-krep-building-a-high-p...</a>
评论 #43337748 未加载
评论 #43335277 未加载
评论 #43335647 未加载
评论 #43339241 未加载
kazinator2 个月前
Where are the test cases?<p>E.g. the chunk boundary stuff in the multi-threaded file search is something that would make me nervous.<p>It brings new edge cases into a simple search, and those edge cases are not directly related to features in the data; just to the happenstance of where the chunk boundaries land.<p>Just by adding one character to a file, a character that obviously lies far outside of any match, we shift the boundaries such that a match could break megabytes away from the insertion.
评论 #43336635 未加载
forgotpwd162 个月前
Homepage shows it significantly faster than ripgrep. Impressive. Would like to see how it compares across the entire ripgrep&#x27;s benchmark suite[1], which also includes a few other similar utilities.<p>edit: Getting an error related to madvise(). Had to insert &#x27;-D_GNU_SOURCE&#x27; in Makefile&#x27;s CFLAGS.<p>[1]: <a href="https:&#x2F;&#x2F;github.com&#x2F;BurntSushi&#x2F;ripgrep&#x2F;blob&#x2F;master&#x2F;benchsuite&#x2F;benchsuite" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;BurntSushi&#x2F;ripgrep&#x2F;blob&#x2F;master&#x2F;benchsuite...</a>
评论 #43334283 未加载
评论 #43334718 未加载
burntsushi2 个月前
This wouldn&#x27;t build for me, so I had to apply the patch suggested by a sibling comment.<p>Once I got it building, my first benchmark attempt shows it as being slower:<p><pre><code> $ curl -LO &#x27;https:&#x2F;&#x2F;burntsushi.net&#x2F;stuff&#x2F;subtitles2016-sample.en.gz&#x27; % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 265M 100 265M 0 0 48.6M 0 0:00:05 0:00:05 --:--:-- 49.9M $ gzip -d subtitles2016-sample.en.gz $ hyperfine --ignore-failure &quot;rg -c &#x27;ZQZQZQZQ&#x27; subtitles2016-sample.en&quot; &quot;krep -c &#x27;ZQZQZQZQ&#x27; subtitles2016-sample.en&quot; Benchmark 1: rg -c &#x27;ZQZQZQZQ&#x27; subtitles2016-sample.en Time (mean ± σ): 80.7 ms ± 1.6 ms [User: 57.7 ms, System: 22.7 ms] Range (min … max): 75.3 ms … 83.3 ms 35 runs Warning: Ignoring non-zero exit code. Benchmark 2: krep -c &#x27;ZQZQZQZQ&#x27; subtitles2016-sample.en Time (mean ± σ): 122.8 ms ± 1.4 ms [User: 372.6 ms, System: 24.4 ms] Range (min … max): 120.2 ms … 125.5 ms 24 runs Summary rg -c &#x27;ZQZQZQZQ&#x27; subtitles2016-sample.en ran 1.52 ± 0.03 times faster than krep -c &#x27;ZQZQZQZQ&#x27; subtitles2016-sample.en </code></pre> That&#x27;s a benchmark with no matches, which is the best case essentially for throughput. Now I want to try a benchmark with a high match frequency:<p><pre><code> $ hyperfine &quot;rg -c &#x27;the&#x27; subtitles2016-sample.en&quot; &quot;krep -c &#x27;the&#x27; subtitles2016-sample.en&quot; Benchmark 1: rg -c &#x27;the&#x27; subtitles2016-sample.en Time (mean ± σ): 411.8 ms ± 3.6 ms [User: 389.7 ms, System: 21.1 ms] Range (min … max): 404.8 ms … 415.7 ms 10 runs Benchmark 2: krep -c &#x27;the&#x27; subtitles2016-sample.en Time (mean ± σ): 121.2 ms ± 1.9 ms [User: 364.6 ms, System: 24.9 ms] Range (min … max): 113.2 ms … 123.0 ms 24 runs Summary krep -c &#x27;the&#x27; subtitles2016-sample.en ran 3.40 ± 0.06 times faster than rg -c &#x27;the&#x27; subtitles2016-sample.en </code></pre> Which is very nice. So I decided to poke at it:<p><pre><code> $ krep -c the subtitles2016-sample.en Found 29794426 matches $ rg -c the subtitles2016-sample.en 6123710 $ grep -c the subtitles2016-sample.en 6123710 </code></pre> The counts are way off here. At first I thought maybe it was counting every occurrence of `the` instead of every matching line, but when I ask ripgrep to do that, it gets a different answer:<p><pre><code> $ rg -oc the subtitles2016-sample.en 7739791 $ rg -o the subtitles2016-sample.en | wc -l 7739791 $ grep -o the subtitles2016-sample.en | wc -l 7739791 </code></pre> So not sure what&#x27;s going on here, but it looks like `krep` might not be giving accurate results.<p>Pushing it a bit more, it seems like it just kind of falls over?<p><pre><code> $ time rg -c &#x27;You read Sherlock Holmes to deduce that\?&#x27; subtitles2016-sample.en 10 real 0.076 user 0.049 sys 0.026 maxmem 923 MB faults 0 $ time krep -c &#x27;You read Sherlock Holmes to deduce that?&#x27; subtitles2016-sample.en Found 0 matches real 0.935 user 3.597 sys 0.029 maxmem 918 MB faults 0 </code></pre> I ran the above benchmarks in `&#x2F;dev&#x2F;shm` on Linux with an i9-12900K.<p>In terms of the approach here, ripgrep is already using a pretty sophisticated substring search algorithm: <a href="https:&#x2F;&#x2F;github.com&#x2F;BurntSushi&#x2F;memchr?tab=readme-ov-file#algorithms-used" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;BurntSushi&#x2F;memchr?tab=readme-ov-file#algo...</a><p>And it uses memory maps (sometimes, when it thinks it will be fast, but it will do so in the single file case on Linux).<p>ripgrep also uses parallelism, but at inter-file level. It sounds like `krep` also uses parallelism, but will use multiple threads when searching a single file. I&#x27;ve considered doing the same in ripgrep, but haven&#x27;t done enough experiments (or seen enough from someone else) to be convinced that it&#x27;s the right way to go in general. It might edge out single threaded search in some cases for sure though.<p>EDIT: Looking at the timings in <a href="https:&#x2F;&#x2F;dev.to&#x2F;daviducolo&#x2F;introducing-krep-building-a-high-performance-string-search-utility-2pdo" rel="nofollow">https:&#x2F;&#x2F;dev.to&#x2F;daviducolo&#x2F;introducing-krep-building-a-high-p...</a>, I see, for example, ripgrep taking &gt;40 seconds to search for the literal pattern `error` in a 5GB file. Even if you&#x27;re reading from disk (which the OP is using an SSD), that does not seem right at all. Even for an exceptionally common word like `the` in this haystack, ripgrep can chew through a 13GB file in 5 seconds on my machine:<p><pre><code> $ time rg -c the full.txt 83499915 real 5.404 user 5.092 sys 0.302 maxmem 12511 MB faults 0 </code></pre> Even if I force reading from disk, we get nowhere near 40 seconds:<p><pre><code> $ sudo sh -c &#x27;echo 3 &gt; &#x2F;proc&#x2F;sys&#x2F;vm&#x2F;drop_caches&#x27; $ time rg -c the full.txt 83499915 real 10.577 user 5.191 sys 2.105 maxmem 12511 MB faults 42 </code></pre> I&#x27;m not saying the benchmark results are definitely wrong, but something <i>looks</i> off here that I can&#x27;t easily explain. OP, can you please share a way to fully reproduce your benchmark? (Like I did above for `subtitles2016-sample.en`.)
评论 #43334854 未加载
评论 #43338802 未加载
hn_acc12 个月前
Do you have an explanation for the obviously wrong answers in simple examples shown here?
评论 #43337844 未加载
creaktive2 个月前
Very cool! The repo is a reference for minimalism. Also, TIL about ifeq in Makefile. So, many thanks!
评论 #43335387 未加载
daviducolo2 个月前
Source Code on GitHub: <a href="https:&#x2F;&#x2F;github.com&#x2F;davidesantangelo&#x2F;krep" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;davidesantangelo&#x2F;krep</a>
groos2 个月前
Minor nit: the &quot;re&quot; part of grep stands for &quot;regular expression&quot;. That doesn&#x27;t seem to be the case with krep so it&#x27;s a bit misnamed maybe?
评论 #43336711 未加载
malkia2 个月前
How do you handle disk errors and file mapping?<p>To give more context, if there is a disk error (logical, physical, etc.), an &quot;fread&quot; would simply return an error, it won&#x27;t interfere with the rest.<p>But with memory mapped files, you have to deal this in someway. For example on Windows, through SEH (__try &#x2F; __except) around blocks reading (or writing) to memory mapped files.<p>Just wondering...
daviducolo2 个月前
try the latest version 0.1.5: <a href="https:&#x2F;&#x2F;github.com&#x2F;davidesantangelo&#x2F;krep&#x2F;releases&#x2F;tag&#x2F;v0.1.5" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;davidesantangelo&#x2F;krep&#x2F;releases&#x2F;tag&#x2F;v0.1.5</a>
jurschreuder2 个月前
Finally something not-Rust!
评论 #43336004 未加载
simlevesque2 个月前
I love the install process.
评论 #43336366 未加载
OhMeadhbh2 个月前
#include &lt;snarky_comment_about_not_using_rust.h&gt;<p>Seriously though... thx! this is directly applicable to current interests and the code is not a jumbled mess.
torlok2 个月前
This is only for string matching? I can&#x27;t find any mentions of regular expression support. Why use the &quot;re&quot; naming scheme?
评论 #43335297 未加载
评论 #43335300 未加载
stefanos822 个月前
It&#x27;s weird that the_silver_searcher, also known as `ag` [1] is not mentioned in benchmarks, which is also implemented in C.<p>I wonder why...<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;ggreer&#x2F;the_silver_searcher" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;ggreer&#x2F;the_silver_searcher</a>
评论 #43335298 未加载
评论 #43335153 未加载
buggz2k2 个月前
did I possibly just download malware? this does not work on M4 Mac, nor Intel windows. anyone else get this to work?
deepanwadhwa2 个月前
Is ahocorasick in a different plane than this?
评论 #43337891 未加载
oulipo2 个月前
Nice, but why not just do a PR a ripgrep to add your algo?
评论 #43334868 未加载