科技回声

11 条评论

asicsp超过 2 年前

For another real-world example, see "Details of the Cloudflare outage on July 2, 2019" [1]Python `re` module now supports possessive quantifiers and atomic grouping (I wrote a blog post here [2]). So, for example, `(a+|\w+)*+:` instead of `(a+|\w+)*:` will avoid such catastrophic behavior.[1] <a href="https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/" rel="nofollow">https://blog.cloudflare.com/details-of-the-cloudflare-outage...</a>[2] <a href="https://learnbyexample.github.io/python-regex-possessive-quantifier/" rel="nofollow">https://learnbyexample.github.io/python-regex-possessive-qua...</a>

评论 #33593535 未加载

评论 #33592958 未加载

评论 #33594144 未加载

评论 #33594580 未加载

xrayarx超过 2 年前

The article starts with the famous quote by Jamie Zawinski:<pre><code> Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems. </code></pre> It then goes on to discuss the perils of regular expressions, when they encounter unanticipated data and run into catastrophic backtracking because of greedy expressions.The article is short and to the point.

评论 #33592290 未加载

Narann超过 2 年前

I've been in this situation many times.> However, the regular expression module is a C module that doesn’t release the GIL before running.I would love to know why.

评论 #33592171 未加载

oweiler超过 2 年前

I sometimes think regular expressions being greedy is a bad default.

评论 #33592970 未加载

评论 #33595774 未加载

评论 #33592972 未加载

photochemsyn超过 2 年前

The simple example is fixable by anchoring the inner greedy expression, i.e.regex = re.compile(r'(^a+)+b')Perhaps the more general problem was that "The regular expression that was causing the issue here was rather long and convoluted".A good rule seems to be never have long and convoluted regexs, instead have a pipeline of simple-to-understand regexes that can be independently debugged (I don't know if this could cause performance slowdowns however).

评论 #33595691 未加载

cmacleod4超过 2 年前

Need I point out that one can avoid both the issues highlighted here by just using Tcl?Regexps in Tcl use a non-backtracking implementation which handles such cases with minimal slowdown, e.g. the example shown with 16 and 32-character inputs:<pre><code> (dep) 7 % timerate {regexp (a+)+b aaaaaaaaaaaaaaaa} 0.353737 µs/# 2826961 # 2826961 #/sec 1000.000 net-ms (dep) 8 % timerate {regexp (a+)+b aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa} 0.412612 µs/# 2423587 # 2423587 #/sec 1000.000 net-ms</code></pre> One discussion of this can be found at <a href="https://comp.lang.tcl.narkive.com/XPVME8zS/tcl-regexp-performance" rel="nofollow">https://comp.lang.tcl.narkive.com/XPVME8zS/tcl-regexp-perfor...</a> .Tcl's threading implementation has no GIL, which not only avoids the type of lock-up described here, but also allows performance to scale to use all cores of your machine, see: <a href="https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-faster-than-python-for-database-benchmarking/" rel="nofollow">https://www.hammerdb.com/blog/uncategorized/why-tcl-is-700-f...</a> .

BerislavLopac超过 2 年前

The real problem is using multithreading in CPython for any set of tasks that has a nonzero chance to block one of the tasks due to a CPU-bound issue. The rule of thumb here - and I learned that the hard way - is to use async/await for anything I/O bound, and multiprocessing for anything CPU-bound.

评论 #33600677 未加载

wikfwikf超过 2 年前

Why is the run time exponential in the length of the string of a's? It seems like it should be at most O(n^3).2^n is the number of subsets of the string of a's with no contiguity requirements. Nothing like this number of possible matches should need to be checked. What am I missing?

adql超过 2 年前

Why Perl version of regexps returns near-instantly for same regexp ?I tested few other on regex101.com and none of them ran longer than a millisecond or two

评论 #33593071 未加载

brenns10超过 2 年前

It really feels like they buried the lede here, as almost all discussions about catastrophic backtracking seem to do. Here's the second-to-last paragraph:> Alternatively you could switch to a regular expression engine that doesn’t exhibit this kind of behaviour. RE2 is an excellent library that uses a finite automata approach to avoid this type of catastrophic backtracking. Using this library on the same 32 character input reduces the running time from 12 minutes to basically instant. The downside being that RE2 doesn’t support all the functionality of the builtin re package.Perl Compatible Regular Expressions (the flavor used by most engines, including Python's) _require_ this exponential worst-case behavior (using an implementation called backtracking). But the theory behind regex does not require this, and if you eliminate just a couple (infrequently used) features of PCRE, you can have a regex engine that runs in linear time (with respect to the size of the input string). See [2], this is by the authors of RE2. The features which are incompatible with this sort of implementation are:1. Lookahead assertions. 2. Backreferences to matched groups.If you don't know what these are, or you rarely ever use them, then you really have no business using this kind of regex engine. (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point). Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library, and exposed the entire industry to this stupidity, all for a few backreferences and lookahead assertions.What's especially crazy is this: it's easy for a regex engine to detect the use of these constructs, because it has to parse the regex anyway! So it's feasible for the engine to optimistically try to use an efficient implementation, and only fall back to using a backtracking implementation when you're actually using these features. This is what the Python re2 module[3] does, it uses RE2 by default and supports falling back to the backtracking implementation if necessary.Instead, we're stuck reading the same postmortem every few years describing how "catastrophic backtracking" ruined another company/person's day, when the problem has been solved for decades, and language/library creators have just failed to include that solution.[1]: Rob Pike invented, or at least popularized, the algorithm used by RE2. He was also involved in the creation of Go, as was Russ Cox[2]. [2]: <a href="https://swtch.com/~rsc/regexp/" rel="nofollow">https://swtch.com/~rsc/regexp/</a> [3]: <a href="https://pypi.org/project/re2/" rel="nofollow">https://pypi.org/project/re2/</a>

评论 #33599074 未加载

cutler超过 2 年前

Just use Perl.

评论 #33594690 未加载

11 条评论

asicsp超过 2 年前

评论 #33593535 未加载

评论 #33592958 未加载

评论 #33594144 未加载

评论 #33594580 未加载

xrayarx超过 2 年前

评论 #33592290 未加载

Narann超过 2 年前

I've been in this situation many times.> However, the regular expression module is a C module that doesn’t release the GIL before running.I would love to know why.

评论 #33592171 未加载

oweiler超过 2 年前

I sometimes think regular expressions being greedy is a bad default.

评论 #33592970 未加载

评论 #33595774 未加载

评论 #33592972 未加载

photochemsyn超过 2 年前

评论 #33595691 未加载

cmacleod4超过 2 年前

BerislavLopac超过 2 年前

评论 #33600677 未加载

wikfwikf超过 2 年前

adql超过 2 年前

Why Perl version of regexps returns near-instantly for same regexp ?I tested few other on regex101.com and none of them ran longer than a millisecond or two

Python, Catastrophic Regular Expressions and the GIL (2013)

11 条评论

Python, Catastrophic Regular Expressions and the GIL (2013)

11 条评论