The Greatest Regex Trick Ever (2014)

227 pointsby user_235711over 9 years ago

24 comments

wonnageover 9 years ago

Regular expressions match regular languages (hence the name). If your language involves pairs of things (e.g HTML), it's not regular. Perl hacked support for this in via backreferences and other extensions, but these are slow and illegible. Use a proper context-free grammar parser if you need to parse a context free grammar, you know?More broadly, people fear and misunderstand regexes because they have no idea how they work. It becomes much easier if you understand how they map to deterministic finite state machines. Recommended reading: <a href="https://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp1.html</a>Once you understand how they work, you can basically read a regex left to right and intuitively know all the strings they'd match. There is no such thing as an unmaintainable/illegible basic regex - they're just words with some placeholders in them - it's when you cram in extended functionality (which is basically a programming language where all the keywords are single characters) that shit hits the fan.

评论 #10282632 未加载

评论 #10282802 未加载

评论 #10283195 未加载

评论 #10282751 未加载

评论 #10283148 未加载

评论 #10282643 未加载

评论 #10283177 未加载

评论 #10282627 未加载

评论 #10284175 未加载

JoeCoder_over 9 years ago

Here is the TL;DR. This regex matches Tarzan but not "Tarzan":<pre><code> "Tarzan"|(Tarzan) </code></pre> You can also include more than one case of what you don't want to match. This one also finds only the cases of Tarzan that don't match the first three patterns:<pre><code> Tarzania|--Tarzan--|"Tarzan"|(Tarzan) </code></pre> You can even use more complex regexes. This matches all words not in an image tag:<pre><code> <img[^>]+>|(\w+) </code></pre> And likewise this matches anything not surrounded by tags:<pre><code> [^<]*|([\w\s]+)</code></pre>

评论 #10283482 未加载

评论 #10283723 未加载

singingfishover 9 years ago

I once used the wonderful perl module [Regexp::Assemble](<a href="https://metacpan.org/pod/Regexp::Assemble" rel="nofollow">https://metacpan.org/pod/Regexp::Assemble</a>) to produce a regexp to match [every single suburb/town name in Australia](<a href="https://gist.githubusercontent.com/singingfish/d43c884fbac0089d8523/raw/63eeaf88a2e8d896c07da3b2440080233dd48395/regex%2520for%2520every%2520suburb%2520name%2520in%2520australia.txt" rel="nofollow">https://gist.githubusercontent.com/singingfish/d43c884fbac00...</a>) from a csv file download from the post office website. It was blazingly fast ... considering (better than the recdescent parser I'd been previously experimenting with).Here's the code that generated the regex:<pre><code> use Regexp::Assemble; my $ra = Regexp::Assemble->new; while (<$FH>) { $csv->parse($_); next if $. == 1; my @fields = $csv->fields; $ra->add($fields[1]); } my $suburbs = $ra->as_string;</code></pre>

评论 #10282335 未加载

bzaover 9 years ago

This technique is unreliable in practice, and the author's discussion is confused.First, their explanation doesn't make sense. They're supposing that there's some determinacy in the order in which a matcher can be expected to examine the different possible matches. But that's provably not the case: if it were, then deterministic and non-determinsitic finite automata would be inequivalent.But the technique in question does seem to require some determinacy as to which of several alternatives will match against a string. Where does that determinacy come from? The semantics of the alternation operator (the '|') as usually formulated don't specify any preference among alternations. For that reason, POSIX additionally requires that a matcher return the longest possible match (and if there are several such, the leftmost is what must be returned). Where you do find an explicit guarantee concerning which of several different possible ways of matching will be preferred, it's almost certainly because the engine is aiming at POSIX compliance.Such compliance has a significant cost, though, as it requires the matcher to consider all possible matches (in order to find the largest). For that reason, most regex engines forego strict POSIX compliance and only guarantee that some match will be returned if one exists, not that that match will be the leftmost longest. Some engines offer the option of requesting strict POSIX behavior, but the default will always be to eagerly return the first match encountered (and recall the point above that there provably can't be a guarantee about the order in which matches are encountered, in general).You should never do this in production code unless you're sure that your matcher is POSIX-compliant.

edoloughlinover 9 years ago

I almost wrote this off as it seemed to be about how to write unmaintainable regex soup but the author pulled something quite elegant out of the hat at the end.

leeoniyaover 9 years ago

this one is great, too. matches all printable ascii characters:[ -~]<a href="http://www.catonmat.net/blog/my-favorite-regex/" rel="nofollow">http://www.catonmat.net/blog/my-favorite-regex/</a>

评论 #10282424 未加载

smegelover 9 years ago

> Match Tarzan but not "Tarzan"Unfortunately it doesn't work.Let's say I wanted to match a string following Tarzan but not "Tarzan", I will try his technique:<pre><code> ("Tarzan"|(Tarzan))\s+and JillOfTheJungle </code></pre> Unfortunately this matches both:<pre><code> "Tarzan" and JillOfTheJungle </code></pre> and<pre><code> Tarzan and JillOfTheJungle </code></pre> Or maybe he meant:> Capture Tarzan but not "Tarzan"

评论 #10282435 未加载

评论 #10283165 未加载

rntzover 9 years ago

This "trick" is simply exploiting a bug in regex implementations.The regex<pre><code> "Tarzan"|(Tarzan) </code></pre> should match the string<pre><code> "Tarzan" </code></pre> in two ways: first, matching the entire string; and second, matching the substring "Tarzan" in the whole string "\"Tarzan\"". But most regex implementations drop extra overlapping matches. I argue this is incorrect behavior, because it complicates understanding what a regex means - you have to understand the /order/ in which your regular expression matcher interprets your regular expression, which is an implementation detail. I conjecture that a DFA-based regex engine would not be able to exhibit this order-biased behavior, at least not with the standard approach.However, it's interesting that this "bug" turns out to be a "feature" for the case of excluding other behavior. I'm not sure what conclusion to draw from this.

评论 #10284495 未加载

aaronbrethorstover 9 years ago

This article desperately needs a tl;dr, but major props for the two regex tricks at the very top of the page.

userbinatorover 9 years ago

That reminds me of something else with regex which I thought was extremely clever: implementing an A* search: <a href="http://realgl.blogspot.com/2013/08/battlecode.html" rel="nofollow">http://realgl.blogspot.com/2013/08/battlecode.html</a>

评论 #10283732 未加载

inceptedover 9 years ago

Meh.I think an even greater Regexp trick is the regular expression that determines primality:<a href="http://stackoverflow.com/questions/3296050/how-does-this-regex-find-primes" rel="nofollow">http://stackoverflow.com/questions/3296050/how-does-this-reg...</a>

评论 #10283503 未加载

smsm42over 9 years ago

Very nice trick, while using of foo|(bar) is very simple, somehow I don't see such approach being used very often, and it looks like it could simplify a number of things.

rbobbyover 9 years ago

Maybe I'm too old. I tend to think of a regex as either matching or not matching.Finding a bit of code that uses a capture to determine whether a match was found seems like it would easily be confusing/inobvious.Some pretty clear commenting and it would be ok... maybe.Also... I wonder how well it would work as part of a larger regex, one that already uses captures (or non-capturing groups)? The examples are all nice, short and sweet... but how often do regex based solutions stay short and sweet? A few maintenance cycles/years and suddenly you've got this funky regex/capture thing that only Bob understands and he's way to busy to talk to you for 5 minutes... and once you change things then Bob suddenly finds time to review your code to complain how you broke it for such a simple change. There goes your bonus you told the wife you were sure to get so you could take her and the kids on vacation. The day after your divorce finalized Bob sends you a fix request to use that improved scheme of yours because the old regex one isn't flexible enough anymore.

forrestthewoodsover 9 years ago

I thought the answer was going to be "tricking the world into thinking regex was a good idea". I've always considered regex to have the rare and elusive "write only" flag. Write only. As opposed to read only. Because once you write a regex that's it. You will never know what it does ever again.

评论 #10282471 未加载

评论 #10282840 未加载

评论 #10282352 未加载

评论 #10282303 未加载

评论 #10282405 未加载

评论 #10282308 未加载

评论 #10282457 未加载

评论 #10282584 未加载

breyover 9 years ago

Very elegant, I like it.Except I'm now having a major case of semantic satiation for the word "Tarzan"...

chris_wotover 9 years ago

I love regexes but they make my head hurt. I really rather badly need to spend some time properly learning it inside and out.

评论 #10283013 未加载

bro-stickover 9 years ago

Unimpressive. The author of this article obviously didn't have a compiler class where one learns how regexes are basically glorified NFAs that are deterministicly convertible into a much more efficient DFA state machines (read: PCRE JIT), instead of assuming regexes are processed by O(N^2) algorithms.

评论 #10283704 未加载

wodenokotoover 9 years ago

Generally I found it strange how difficult it is to do a search for everything except something.The script looks elegant, but like the author mentions, doesn't work in a text-editor, so I would consider it the greatest.

评论 #10282639 未加载

amlutoover 9 years ago

How does this handle '"Tarzan" Tarzan'?

jfbover 9 years ago

Convincing the world that they don't exist?

lognover 9 years ago

More like the author tricked you by changing the problem halfway through the very long article. Try this instead:<pre><code> (?:(?<!")|(?!Tarzan"))Tarzan</code></pre>

jwhiteover 9 years ago

... and now you have two problems.

pkayeover 9 years ago

Reminds me of this quote from Jamie Zawinski: "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

评论 #10283154 未加载

评论 #10283111 未加载

edwardover 9 years ago

Parsing HTML with a regex? You should read this answer on Stack Overflow: <a href="http://stackoverflow.com/a/1732454/84250" rel="nofollow">http://stackoverflow.com/a/1732454/84250</a>