TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Regular Expression Matching Can Be Simple and Fast (2007)

78 pointsby Toast_25over 7 years ago

10 comments

twicover 7 years ago
There&#x27;s probably someone who posts this comment every time this article comes up, but:<p><pre><code> #![feature(test)] extern crate regex; extern crate test; use regex::Regex; pub fn check(pattern: &amp;Regex, input: &amp;str) -&gt; bool { pattern.is_match(input) } #[cfg(test)] mod tests { use super::*; use test::Bencher; #[bench] fn bench_check(b: &amp;mut Bencher) { let n = 29; let pattern = Regex::new(&amp;format!(&quot;{}{}&quot;, &quot;a?&quot;.repeat(n), &quot;a&quot;.repeat(n))).unwrap(); let input = &quot;a&quot;.repeat(n); b.iter(|| check(&amp;pattern, &amp;input)); } } </code></pre> And then:<p><pre><code> $ cat rust-toolchain nightly-2018-02-09 $ cargo bench Compiling void v1.0.2 Compiling lazy_static v1.0.0 Compiling regex-syntax v0.4.2 Compiling libc v0.2.36 Compiling utf8-ranges v1.0.0 Compiling unreachable v1.0.0 Compiling thread_local v0.3.5 Compiling memchr v2.0.1 Compiling aho-corasick v0.6.4 Compiling regex v0.2.6 Compiling regexps v0.1.0 (file:&#x2F;&#x2F;&#x2F;home&#x2F;twic&#x2F;Code&#x2F;Regexps) Finished release [optimized] target(s) in 20.11 secs Running target&#x2F;release&#x2F;deps&#x2F;regexps-9c084e63d7d31ac3 running 1 test test tests::bench_check ... bench: 212 ns&#x2F;iter (+&#x2F;- 8) test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out </code></pre> Rust&#x27;s standard regex library isn&#x27;t the fastest in the world, and the lack of backtracking can be limiting, but it is highly optimised for Rust&#x27;s main use case of posting showoff comments on Hacker News.<p>I also tried this in Java (OpenJDK 1.8.0_161-b14), and it took about 24 seconds per iteration at 29 characters.
评论 #16344345 未加载
评论 #16345746 未加载
thedirt0115over 7 years ago
This is an excellent read. I would suggest reading this also (maybe even before), as it provides one of the simplest (incomplete, but still useful) regex implementations you&#x27;ll ever see: <a href="https:&#x2F;&#x2F;www.cs.princeton.edu&#x2F;courses&#x2F;archive&#x2F;spr09&#x2F;cos333&#x2F;beautiful.html" rel="nofollow">https:&#x2F;&#x2F;www.cs.princeton.edu&#x2F;courses&#x2F;archive&#x2F;spr09&#x2F;cos333&#x2F;be...</a> It suffers from the problems remedied by the original submission (backtracking instead of DFA), but the code is shorter and simpler. They&#x27;re great to compare.
glangdaleover 7 years ago
The biggest gap in the references&#x2F;history section is the failure to mention V. Glushkov, whose NFA construction has much appeal and certainly predates Thompsons&#x27;. Reading the references, one might also draw the conclusion that regex implementation almost ceased between 1968 and 2007.<p>In our experience, regex matching <i>can</i> be simple and fast, but only sometimes. RE2&#x27;s approach is straightforward but naive; we built a considerably faster (on some metrics) and more scalable system in Hyperscan. The cost of this turned out to considerable complexity - and there are still many regexes and inputs that don&#x27;t perform particularly well in either system.
patrickmayover 7 years ago
I thought this was going to be about compiling regular expressions, which is how CL-PPCRE (<a href="https:&#x2F;&#x2F;edicl.github.io&#x2F;cl-ppcre&#x2F;" rel="nofollow">https:&#x2F;&#x2F;edicl.github.io&#x2F;cl-ppcre&#x2F;</a>) manages to be faster than C. Better algorithms beat compilation, though.
carapaceover 7 years ago
Slightly off-topic, <i>derivatives of regular expressions</i> is fun and interesting:<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Brzozowski_derivative" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Brzozowski_derivative</a><p><a href="http:&#x2F;&#x2F;lambda-the-ultimate.org&#x2F;node&#x2F;2293&#x2F;" rel="nofollow">http:&#x2F;&#x2F;lambda-the-ultimate.org&#x2F;node&#x2F;2293&#x2F;</a>
harpocratesover 7 years ago
The comments from the original post[1] of this are still quite relevant. I feel like I&#x27;ve seen this link posted here several times (much more recently than 2009), but I can&#x27;t find the links.<p><pre><code> [1]: https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=466845</code></pre>
评论 #16343254 未加载
评论 #16343274 未加载
anilshanbhagover 7 years ago
There are modules that implement the Thompson&#x27;s algorithm For Python: <a href="https:&#x2F;&#x2F;github.com&#x2F;xysun&#x2F;regex" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;xysun&#x2F;regex</a>
u801eover 7 years ago
The vim text editor actually can use the NFA regular expression engine [1] for some searches. I&#x27;m not sure if that&#x27;s an option for some of the languages mentioned in the article.<p>[1] <a href="http:&#x2F;&#x2F;vimhelp.appspot.com&#x2F;options.txt.html#%27regexpengine%27" rel="nofollow">http:&#x2F;&#x2F;vimhelp.appspot.com&#x2F;options.txt.html#%27regexpengine%...</a>
gumbyover 7 years ago
This should be labeled (2007) and has been on HN several times before. It&#x27;s a good article.<p>The article (and the wikipedia) say Thomson put regexps into a version of QED for CTSS, but CTSS was only used at MIT. He must have traveled between MIT and Bell labs as part of the Multics project. The CTSS reference is pretty obscure.
评论 #16343823 未加载
leraxover 7 years ago
Yes, I know. But regex for everything is bad. What about Parsec? Anyone? <a href="http:&#x2F;&#x2F;book.realworldhaskell.org&#x2F;read&#x2F;using-parsec.html" rel="nofollow">http:&#x2F;&#x2F;book.realworldhaskell.org&#x2F;read&#x2F;using-parsec.html</a>
评论 #16343523 未加载