For another real-world example, see "Details of the Cloudflare outage on July 2, 2019" [1]<p>Python `re` module now supports possessive quantifiers and atomic grouping (I wrote a blog post here [2]). So, for example, `(a+|\w+)*+:` instead of `(a+|\w+)*:` will avoid such catastrophic behavior.<p>[1] <a href="https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/" rel="nofollow">https://blog.cloudflare.com/details-of-the-cloudflare-outage...</a><p>[2] <a href="https://learnbyexample.github.io/python-regex-possessive-quantifier/" rel="nofollow">https://learnbyexample.github.io/python-regex-possessive-qua...</a>
The article starts with the famous quote by Jamie Zawinski:<p><pre><code> Some people, when confronted with a problem, think ‘I know, I’ll use regular
expressions.’ Now they have two problems.
</code></pre>
It then goes on to discuss the perils of regular expressions, when they encounter unanticipated data and run into catastrophic backtracking because of greedy expressions.<p>The article is short and to the point.
I've been in this situation many times.<p>> However, the regular expression module is a C module that doesn’t release the GIL before running.<p>I would love to know why.
The simple example is fixable by anchoring the inner greedy expression, i.e.<p>regex = re.compile(r'(^a+)+b')<p>Perhaps the more general problem was that "<i>The regular expression that was causing the issue here was rather long and convoluted</i>".<p>A good rule seems to be never have long and convoluted regexs, instead have a pipeline of simple-to-understand regexes that can be independently debugged (I don't know if this could cause performance slowdowns however).
Need I point out that one can avoid both the issues highlighted here by just using Tcl?<p>Regexps in Tcl use a non-backtracking implementation which handles such cases with minimal slowdown, e.g. the example shown with 16 and 32-character inputs:<p><pre><code> (dep) 7 % timerate {regexp (a+)+b aaaaaaaaaaaaaaaa}
0.353737 µs/# 2826961 # 2826961 #/sec 1000.000 net-ms
(dep) 8 % timerate {regexp (a+)+b aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa}
0.412612 µs/# 2423587 # 2423587 #/sec 1000.000 net-ms</code></pre>
One discussion of this can be found at <a href="https://comp.lang.tcl.narkive.com/XPVME8zS/tcl-regexp-performance" rel="nofollow">https://comp.lang.tcl.narkive.com/XPVME8zS/tcl-regexp-perfor...</a> .<p>Tcl's threading implementation has no GIL, which not only avoids the type of lock-up described here, but also allows performance to scale to use all cores of your machine, see: <a href="https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-faster-than-python-for-database-benchmarking/" rel="nofollow">https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-f...</a> .
The real problem is using multithreading in CPython for any set of tasks that has a nonzero chance to block one of the tasks due to a CPU-bound issue. The rule of thumb here - and I learned that the hard way - is to use async/await for anything I/O bound, and multiprocessing for anything CPU-bound.
Why is the run time exponential in the length of the string of a's? It seems like it should be at most O(n^3).<p>2^n is the number of subsets of the string of a's with no contiguity requirements. Nothing like this number of possible matches should need to be checked. What am I missing?
Why Perl version of regexps returns near-instantly for same regexp ?<p>I tested few other on regex101.com and none of them ran longer than a millisecond or two
It really feels like they buried the lede here, as almost all discussions about catastrophic backtracking seem to do. Here's the second-to-last paragraph:<p>> Alternatively you could switch to a regular expression engine that doesn’t exhibit this kind of behaviour. RE2 is an excellent library that uses a finite automata approach to avoid this type of catastrophic backtracking. Using this library on the same 32 character input reduces the running time from 12 minutes to basically instant. The downside being that RE2 doesn’t support all the functionality of the builtin re package.<p>Perl Compatible Regular Expressions (the flavor used by most engines, including Python's) _require_ this exponential worst-case behavior (using an implementation called backtracking). But the theory behind regex does not require this, and if you eliminate just a couple (infrequently used) features of PCRE, you can have a regex engine that runs in linear time (with respect to the size of the input string). See [2], this is by the authors of RE2. The features which are incompatible with this sort of implementation are:<p>1. Lookahead assertions.
2. Backreferences to matched groups.<p>If you don't know what these are, or you rarely ever use them, then you really have no business using this kind of regex engine. (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point). Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library, and exposed the entire industry to this stupidity, all for a few backreferences and lookahead assertions.<p>What's especially crazy is this: it's easy for a regex engine to <i>detect</i> the use of these constructs, because it has to parse the regex anyway! So it's feasible for the engine to optimistically try to use an efficient implementation, and only fall back to using a backtracking implementation when you're actually using these features. This is what the Python re2 module[3] does, it uses RE2 by default and supports falling back to the backtracking implementation if necessary.<p>Instead, we're stuck reading the same postmortem every few years describing how "catastrophic backtracking" ruined another company/person's day, when the problem has been solved for decades, and language/library creators have just failed to include that solution.<p>[1]: Rob Pike invented, or at least popularized, the algorithm used by RE2. He was also involved in the creation of Go, as was Russ Cox[2].
[2]: <a href="https://swtch.com/~rsc/regexp/" rel="nofollow">https://swtch.com/~rsc/regexp/</a>
[3]: <a href="https://pypi.org/project/re2/" rel="nofollow">https://pypi.org/project/re2/</a>