The article by Russ Cox [0] (from which the initial Perl vs. Thompson NFA graph is taken) is a much more informative read. Also, I'm not sure why this always gets brought up as being a big deal: CS is all about making the right tradeoffs and I agree that sometimes there are cool tricks that let you _almost_ have your cake an eat it. But here, there is no such trick: backtracking has always been exponential and finite state automata have always been linear.<p>No surprises.<p>[0] <a href="https://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp1.html</a>
I stopped reading when a post in 2017 compared anything to perl 5.8.7 which is older than what CentOS 5 shipped.<p>Perl 5.8.x was EOL in 2008. 5.10 was EOL in 2009.<p>The currently maintained versions are 5.22.3 and 5.24.1 and major redesign, refactoring, and rework has been done in the system including the regex parser. It went in the meantime from being a primarily recursive, backtracking NFA to using a DFA when possible, being primarily iterative, and only recursing when necessary.<p>Anyone intentionally picking a more than decade old version whose major version line was end-of-life nearly a decade ago to compare to their new hotness has completely invalidated the evidence for their arguments. All the great things the slides might say may still be true, but they are unsupported by these charts.
Slightly off-topic: BurntSushi wrote a Go binding[1] for the Rust regex engine[2] (he is also the author) which shares the same approach (finite automata) but has a more polished, highly optimized implementation. It could be useful in some scenario for people facing performance issues with Go's regex engine.<p>[1] <a href="https://github.com/BurntSushi/rure-go" rel="nofollow">https://github.com/BurntSushi/rure-go</a>
[2] <a href="https://github.com/rust-lang/regex" rel="nofollow">https://github.com/rust-lang/regex</a>
<i>"[A regex is] a style of describing character strings. If a string successfully describes a regex, then it is called a match"</i><p>Shouldn't that be "if a regex successfully describes a string"? I'm confused.
I think the Thompson NFA matcher could be pushed further if someone would bother to do it. One example is that since it finds all solutions to a given regular expression, it could be used within a larger backtracking matcher to support back-references.<p>Consider:<p><pre><code> (a+)(a+)=\1
</code></pre>
With an input like:
aaaaa=aa<p>The matcher will find these strings by the time it gets to the '=':<p><pre><code> (a)(aaaa)=
(aa)(aaa)=
(aaa)(aa)=
(aaaa)(a)=
</code></pre>
Each match will end up as a separate still-existing thread in the Thompson matcher. Instead of pruning them as soon as possible, keep them and recursively match the rest of the string with the back-reference set in turn to each until a match is found.
TL;DR we use a subset of today's common regexps which allowed us to go lightning fast (as long as you don't need lookahead). Oh, and I'll neglect to be clear about that.<p>Nothing wrong in making the tradeoff but I didn't see it in the text.
I do not get why there still is so much of a hype for regular expressions. They are hard to read and maintain. They have a relatively large syntax. Edge-cases are easy to get wrong with them.
Seriously, can we stop fetishising old approaches and papers? Yes, there's some tricks we've missed along the way, but there's a thread of alchemical thinking in our community that is just plain unproductive.