Fix Boyer-Moore searcher with the Rytter correction

193 点作者 HenryR大约 5 年前

16 条评论

What is a minimal test case (needle + haystack) for which the old code yields a different result than the new code?

评论 #22898522 未加载

评论 #22896981 未加载

oefrha大约 5 年前

The papers:- Boyle, Moore (1977) <a href="https://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf" rel="nofollow">https://www.cs.utexas.edu/users/moore/publications/fstrpos.p...</a>- Knuth, Morris, Pratt (1977) <a href="https://pdfs.semanticscholar.org/4479/9559a1067e06b5a6bf052f8f10637707928f.pdf" rel="nofollow">https://pdfs.semanticscholar.org/4479/9559a1067e06b5a6bf052f...</a>- Rytter (1980) <a href="https://doi.org/10.1137/0209037" rel="nofollow">https://doi.org/10.1137/0209037</a> (paywalled without subscription, but you know where to find it cough scihub cough).

dang大约 5 年前

Submitted title was "A C++17 STL bug that can be attributed to Knuth, and its 40-year-old fix", which is nice additional context, but can now go here rather than there.

tacitusarc大约 5 年前

What I really enjoy about this merge request are these comments:> Rename _Suffix_fn to _Fx, again following the papers. Note that in the usage below, there is a semantic change: _Suffix_fn stored 0-based values, while _Fx stores 1-based values. This undoes a micro-optimization (_Suffix_fn avoided unnecessarily translating between the 0-based and 1-based domains), but with the increased usage of f in the Rytter correction, I wanted greater correspondence with the published algorithms in order to verify the implementation.And> Rename 1-based _Idx to 1-based _Jx. (While the code was correct, I found it confusing that _Idx was 0-based in other loops but 1-based in this loop.)Naming is one of the most strangely difficult aspects of programming, but if there is one rule it's be consistent. I really hate code that uses a variable name to mean one thing in one place and other somewhere else. I just learned what that meant! Why change it?! And yet this is common, and sometimes a comment about it in code review might get a response like "the code works, why does it matter?" This. This is why it matters. It causes confusion, which makes code difficult to verify, which causes bugs.

评论 #22899623 未加载

评论 #22903489 未加载

dvt大约 5 年前

What I find most interesting about this bugfix is simply how terrible of a source Wikipedia is and, at the same time, how valuable professional insight can be (e.g. someone that knows what they're talking about). And to add insult to injury, someone quickly edited Wikipedia (for the "street cred" I'm sure):> The original paper contained errors for computing the table of pattern shifts which was corrected by Wojciech Rytter in 1980 [1]But that's still actually wrong. Computing the table was not even discussed in the "original paper" -- see the papers linked by @oefrha in this thread. They're two different 1977 papers![1] <a href="https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-search_algorithm" rel="nofollow">https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string-sea...</a>

评论 #22899664 未加载

评论 #22899238 未加载

评论 #22896883 未加载

praptak大约 5 年前

Lol, I was taught by Rytter like 20 years ago. He did much more in text algorithms than just correcting B-M. The two books by Crochemore & Rytter are basically the books on text algorithms.

评论 #22899254 未加载

skyde大约 5 年前

This guy is a hero! I have so much respect for his work.By i’m still wondering if this bug could have been detected by automatic fuzzing testing ?

评论 #22898657 未加载

评论 #22896502 未加载

zvrba大约 5 年前

> However, the published algorithm for dd' was incorrect! This was discovered and fixed by Rytter in 1980, which we weren't aware of until we received a bug report.Fits what Knuth said: "Beware of bugs in the above code; I have only proved it correct, not tried it."

karma_fountain大约 5 年前

Nice fix! It's been a while since I've done c++ so does the following code<pre><code> const auto elapsed = steady_clock::now() - start; if (elapsed > 10s) { </code></pre> actually use a units quantifier on the 10, i.e. does 10s mean 10 seconds? Is that a library thing? Cool if so.

评论 #22897005 未加载

评论 #22897089 未加载

rurban大约 5 年前

Nice, but using decade old and overly slow string search algorithms still is questionable. There are hundreds of faster algos, the fastest and most stable currently being epsm (2013). <a href="https://www.dmi.unict.it/~faro/smart/algorithms.php?algorithm=EPSM&code=epsm" rel="nofollow">https://www.dmi.unict.it/~faro/smart/algorithms.php?algorith...</a>

mcbain大约 5 年前

A while back I filed an issue on the MSVC std::vector implementation as the behaviour of a weird corner case differed from GCC/Clang.The reply I got from STL himself was fantastic, pointing out that it was [paraphrasing here] a wart in the spec, and that neither were actually incorrect.To my lasting regret, I didn't keep an ahem offsite backup of some of those corp emails. Hence I can't recall the exact details.

评论 #22898733 未加载

skyde大约 5 年前

Any one know why the Turbo Boyer-Moore algorithm is not more popular. Any case when Boyer Moore would be better?

评论 #22897638 未加载

评论 #22901978 未加载

评论 #22898420 未加载

kazinator大约 5 年前

> I find it very curious that it isn't constantly mentioned when explaining Boyer-MoorePapers need "superseded by ..." type forward references, like IETF RFC's. Or rather "corrected by", in this case.Maybe important algorithms need to be summarized in RFC-like documents.

评论 #22899818 未加载

NCG_Mike大约 5 年前

I remember Dr Dobbs, way back when, having an article on the search algorithm. I think they had C code for it.I did a version in 68K on the Mac (MPW Shell days) for a SGML editor I was working on.

kzrdude大约 5 年前

So there is potentially a simpler fix than the Rytter correction, is that also yet to discover in published papers somewhere, or do we need to ask Knuth himself?

评论 #22899085 未加载

fithisux大约 5 年前

Is Dlang affected?

16 条评论

bumbledraven大约 5 年前

What is a minimal test case (needle + haystack) for which the old code yields a different result than the new code?

评论 #22898522 未加载

评论 #22896981 未加载

oefrha大约 5 年前

dang大约 5 年前

Submitted title was "A C++17 STL bug that can be attributed to Knuth, and its 40-year-old fix", which is nice additional context, but can now go here rather than there.

tacitusarc大约 5 年前

评论 #22899623 未加载

评论 #22903489 未加载

dvt大约 5 年前

评论 #22899664 未加载

评论 #22899238 未加载

评论 #22896883 未加载

praptak大约 5 年前

Lol, I was taught by Rytter like 20 years ago. He did much more in text algorithms than just correcting B-M. The two books by Crochemore & Rytter are basically the books on text algorithms.

评论 #22899254 未加载

skyde大约 5 年前

This guy is a hero! I have so much respect for his work.By i’m still wondering if this bug could have been detected by automatic fuzzing testing ?

评论 #22898657 未加载

评论 #22896502 未加载

zvrba大约 5 年前

karma_fountain大约 5 年前

评论 #22897005 未加载

评论 #22897089 未加载

rurban大约 5 年前

mcbain大约 5 年前

评论 #22898733 未加载

skyde大约 5 年前

Any one know why the Turbo Boyer-Moore algorithm is not more popular. Any case when Boyer Moore would be better?

评论 #22897638 未加载

评论 #22901978 未加载

评论 #22898420 未加载

kazinator大约 5 年前

评论 #22899818 未加载

NCG_Mike大约 5 年前

I remember Dr Dobbs, way back when, having an article on the search algorithm. I think they had C code for it.I did a version in 68K on the Mac (MPW Shell days) for a SGML editor I was working on.

kzrdude大约 5 年前

So there is potentially a simpler fix than the Rytter correction, is that also yet to discover in published papers somewhere, or do we need to ask Knuth himself?

评论 #22899085 未加载

fithisux大约 5 年前

Is Dlang affected?