Regular expressions match regular languages (hence the name). If your language involves pairs of things (e.g HTML), it's not regular. Perl hacked support for this in via backreferences and other extensions, but these are slow and illegible. Use a proper context-free grammar parser if you need to parse a context free grammar, you know?<p>More broadly, people fear and misunderstand regexes because they have no idea how they work. It becomes much easier if you understand how they map to deterministic finite state machines. Recommended reading: <a href="https://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp1.html</a><p>Once you understand how they work, you can basically read a regex left to right and intuitively know all the strings they'd match. There is no such thing as an unmaintainable/illegible basic regex - they're just words with some placeholders in them - it's when you cram in extended functionality (which is basically a programming language where all the keywords are single characters) that shit hits the fan.
Here is the <i></i>TL;DR<i></i>. This regex matches Tarzan but not "Tarzan":<p><pre><code> "Tarzan"|(Tarzan)
</code></pre>
You can also include more than one case of what you don't want to match. This one also finds only the cases of Tarzan that don't match the first three patterns:<p><pre><code> Tarzania|--Tarzan--|"Tarzan"|(Tarzan)
</code></pre>
You can even use more complex regexes. This matches all words not in an image tag:<p><pre><code> <img[^>]+>|(\w+)
</code></pre>
And likewise this matches anything not surrounded by <b> tags:<p><pre><code> <b>[^<]*</b>|([\w\s]+)</code></pre>
I once used the wonderful perl module [Regexp::Assemble](<a href="https://metacpan.org/pod/Regexp::Assemble" rel="nofollow">https://metacpan.org/pod/Regexp::Assemble</a>) to produce a regexp to match [every single suburb/town name in Australia](<a href="https://gist.githubusercontent.com/singingfish/d43c884fbac0089d8523/raw/63eeaf88a2e8d896c07da3b2440080233dd48395/regex%2520for%2520every%2520suburb%2520name%2520in%2520australia.txt" rel="nofollow">https://gist.githubusercontent.com/singingfish/d43c884fbac00...</a>) from a csv file download from the post office website. It was blazingly fast ... considering (better than the recdescent parser I'd been previously experimenting with).<p>Here's the code that generated the regex:<p><pre><code> use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
while (<$FH>) {
$csv->parse($_);
next if $. == 1;
my @fields = $csv->fields;
$ra->add($fields[1]);
}
my $suburbs = $ra->as_string;</code></pre>
This technique is unreliable in practice, and the author's discussion is confused.<p>First, their explanation doesn't make sense. They're supposing that there's some determinacy in the order in which a matcher can be expected to examine the different possible matches. But that's provably not the case: if it were, then deterministic and non-determinsitic finite automata would be inequivalent.<p>But the technique in question does seem to require some determinacy as to which of several alternatives will match against a string. Where does that determinacy come from? The semantics of the alternation operator (the '|') as usually formulated don't specify any preference among alternations. For that reason, POSIX <i>additionally</i> requires that a matcher return the longest possible match (and if there are several such, the leftmost is what must be returned). Where you do find an explicit guarantee concerning which of several different possible ways of matching will be preferred, it's almost certainly because the engine is aiming at POSIX compliance.<p>Such compliance has a significant cost, though, as it requires the matcher to consider <i>all</i> possible matches (in order to find the largest). For that reason, most regex engines forego strict POSIX compliance and only guarantee that some match will be returned if one exists, not that that match will be the leftmost longest. Some engines offer the option of requesting strict POSIX behavior, but the default will always be to eagerly return the first match encountered (and recall the point above that there provably can't be a guarantee about the order in which matches are encountered, in general).<p>You should never do this in production code unless you're sure that your matcher is POSIX-compliant.
I almost wrote this off as it seemed to be about how to write unmaintainable regex soup but the author pulled something quite elegant out of the hat at the end.
this one is great, too. matches all printable ascii characters:<p>[ -~]<p><a href="http://www.catonmat.net/blog/my-favorite-regex/" rel="nofollow">http://www.catonmat.net/blog/my-favorite-regex/</a>
> Match Tarzan but not "Tarzan"<p>Unfortunately it doesn't work.<p>Let's say I wanted to match a string following Tarzan but not "Tarzan", I will try his technique:<p><pre><code> ("Tarzan"|(Tarzan))\s+and JillOfTheJungle
</code></pre>
Unfortunately this matches both:<p><pre><code> "Tarzan" and JillOfTheJungle
</code></pre>
and<p><pre><code> Tarzan and JillOfTheJungle
</code></pre>
Or maybe he meant:<p>> Capture Tarzan but not "Tarzan"
This "trick" is simply exploiting a bug in regex implementations.<p>The regex<p><pre><code> "Tarzan"|(Tarzan)
</code></pre>
should match the string<p><pre><code> "Tarzan"
</code></pre>
in <i>two</i> ways: first, matching the entire string; and second, matching the substring "Tarzan" in the whole string "\"Tarzan\"". But most regex implementations drop extra overlapping matches. I argue this is incorrect behavior, because it complicates understanding what a regex <i>means</i> - you have to understand the /order/ in which your regular expression matcher interprets your regular expression, which is an implementation detail. I conjecture that a DFA-based regex engine would not be able to exhibit this order-biased behavior, at least not with the standard approach.<p>However, it's interesting that this "bug" turns out to be a "feature" for the case of excluding other behavior. I'm not sure what conclusion to draw from this.
That reminds me of something else with regex which I thought was extremely clever: implementing an A* search: <a href="http://realgl.blogspot.com/2013/08/battlecode.html" rel="nofollow">http://realgl.blogspot.com/2013/08/battlecode.html</a>
Meh.<p>I think an even greater Regexp trick is the regular expression that determines primality:<p><a href="http://stackoverflow.com/questions/3296050/how-does-this-regex-find-primes" rel="nofollow">http://stackoverflow.com/questions/3296050/how-does-this-reg...</a>
Very nice trick, while using of foo|(bar) is very simple, somehow I don't see such approach being used very often, and it looks like it could simplify a number of things.
Maybe I'm too old. I tend to think of a regex as either matching or not matching.<p>Finding a bit of code that uses a capture to determine whether a match was found seems like it would easily be confusing/inobvious.<p>Some pretty clear commenting and it would be ok... maybe.<p>Also... I wonder how well it would work as part of a larger regex, one that already uses captures (or non-capturing groups)? The examples are all nice, short and sweet... but how often do regex based solutions stay short and sweet? A few maintenance cycles/years and suddenly you've got this funky regex/capture thing that only Bob understands and he's way to busy to talk to you for 5 minutes... and once you change things then Bob suddenly finds time to review your code to complain how you broke it for such a simple change. There goes your bonus you told the wife you were sure to get so you could take her and the kids on vacation. The day after your divorce finalized Bob sends you a fix request to use that improved scheme of yours because the old regex one isn't flexible enough anymore.
I thought the answer was going to be "tricking the world into thinking regex was a good idea". I've always considered regex to have the rare and elusive "write only" flag. Write only. As opposed to read only. Because once you write a regex that's it. You will never know what it does ever again.
Unimpressive. The author of this article obviously didn't have a compiler class where one learns how regexes are basically glorified NFAs that are deterministicly convertible into a much more efficient DFA state machines (read: PCRE JIT), instead of assuming regexes are processed by O(N^2) algorithms.
Generally I found it strange how difficult it is to do a search for everything except something.<p>The script looks elegant, but like the author mentions, doesn't work in a text-editor, so I would consider it the greatest.
More like the author tricked you by changing the problem halfway through the very long article. Try this instead:<p><pre><code> (?:(?<!")|(?!Tarzan"))Tarzan</code></pre>
Reminds me of this quote from Jamie Zawinski: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."
Parsing HTML with a regex? You should read this answer on Stack Overflow: <a href="http://stackoverflow.com/a/1732454/84250" rel="nofollow">http://stackoverflow.com/a/1732454/84250</a>