If you're interested in regular expressions and their place in automata, Jeff Ullman's <i>Automata</i> course starts today on Coursera: <a href="https://www.coursera.org/course/automata" rel="nofollow">https://www.coursera.org/course/automata</a><p>The recent HN discussion of its announcement is here: <a href="https://news.ycombinator.com/item?id=10089092" rel="nofollow">https://news.ycombinator.com/item?id=10089092</a><p>Ullman is also coauthor of "The Dragon Book".
Google has a similar library with similar goals. See <a href="https://github.com/google/re2/wiki/CplusplusAPI" rel="nofollow">https://github.com/google/re2/wiki/CplusplusAPI</a> It also removes backtracking.<p>The idea is that backtracking may kill performance, so a specially crafted text that causes a lot of backtracking can be used as a DoS attack.
Wow, really impressive. Sometimes specializing by cutting out functionality is the right approach. In this case eliminating greedy/non-greedy matching (and others) means this can work as a high-level triage and something with more specificity can do the precision work once you have a candidate match.<p>It looks like this could have a good place in a real-time streaming architecture somewhere.
README.ru has the real documentation- google translate does a pretty good job with it. It mentions that the algorithms are from the Dragon book.<p>I didn't try the code, but I think it's missing full Unicode character class support (for example when you use \w). But I see it handles Russian :-)<p><a href="https://github.com/yandex/pire/blob/master/pire/classes.cpp#L82" rel="nofollow">https://github.com/yandex/pire/blob/master/pire/classes.cpp#...</a>
See also <a href="https://swtch.com/%7Ersc/regexp/" rel="nofollow">https://swtch.com/%7Ersc/regexp/</a>
What I don't get is that the example given:<p><pre><code> hello\\s+w.+d$
</code></pre>
Is 100% perl compatible, seems more like "subset" than "incompatible". I've seen comments that say it's a "joke". Can any confirm that the title was indeed a joke?<p>Edit: I know both what a DFA/NFA are, and how they relate to formal language theory and regular languages, the question still stands how a subset can be called "incompatible"
What was wrong with the GNU basic regex?<p>If you're going to write a stripped down string matching syntax more strictly for "regular" text then why bother mentioning perl?
Scary to think that a major search engine really uses regular expressions heavily. Regexprs are great for quick scripts, but one would expect that in major production applications better and higher level parsing algorithms would be used. It must be a nightmare to debug if you have a lot of reg-exprs interacting in a large code base.