TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Describing and Inventing a New Regular Expression Quantifier

21 点作者 robertelder超过 4 年前

5 条评论

jlokier超过 4 年前
From the article:<p>&gt; I will suggest, rather opinionatedly, that possessive quantifiers aren&#x27;t particularly useful.<p>I will suggest the opposite, that possessive quantifiers are so useful they should have been the default, as backtracking causes problems of performance and accidental mismatching that you usually don&#x27;t want.<p>Backtracking is a really useful invention. But it should not be used on every term.<p>I default to using possessives more often than not these days. I&#x27;ve writte a lot of parsers over the years where regexes played a role, and this is what I found in practice.<p>Now, a non-possessive quantifier makes me stop and think if it should be there, and what kind of problem is it hiding. Usually it doesn&#x27;t matter, especially when the regex is being used on &quot;known&quot; data, but occasionally it&#x27;s hiding a subtle parsing bug.<p>For example, any regex that is matching whitespace-separated terms. For the whitespace, which makes more sense, &quot;\s+&quot; or &quot;\s++&quot;?<p>The latter. Backtracking over whitespace is just a waste of time; it won&#x27;t change what is matched, and if it does, that&#x27;s usually a bug in the rest of the regex.<p>What about matching words in a parser of a typical programming language, does it make more sense to write &quot;\w+&quot; or &quot;\w++&quot;? The latter again. You do not want to accidentally treat &quot;iffy&quot; as though it&#x27;s an &quot;if&quot; followed by &quot;fy&quot;.<p>You can explicitly require a word boundary or a non-word character after the word. But which is more straightforward (and faster) out of &quot;\w++&quot;, &quot;\w+\b&quot; and &quot;\w+(?:\W|$)&quot;?
eru超过 4 年前
Note: the author is not describing regular expressions (ie those that correspond to regular languages and nothing else), but seems to be describing some mixed beast. See eg their mention of backtracking, which is an implementation technique used in some implementations of regular expressions (where it adds nothing); and also in many implementations of not-quite-regular-expressions, where it can make a real difference.
评论 #24540536 未加载
评论 #24540402 未加载
lifthrasiir超过 4 年前
Yeah, the possessive quantifier looks like that it simply disables backtracking but it also silently makes the matching greedy. It should really have been orthogonal: `a+` is greedy and backtrackable, `a+?` is non-greedy and backtrackable, `a++` is greedy and atomic, and `a+?+` is non-greedy and atomic. It&#x27;s not pretty, but multiple quantifiers were never pretty after all.<p>Or better, ditch the opportunistic use of otherwise invalid quantifier sequences and stick to the atomic group. Vim got this almost correct [1]:<p><pre><code> multi ~ &#x27;magic&#x27; &#x27;nomagic&#x27; matches of the preceding atom ~ |&#x2F;\{| \{n,m} \{n,m} n to m as many as possible \{n} \{n} n exactly \{n,} \{n,} at least n as many as possible \{,m} \{,m} 0 to m as many as possible \{} \{} 0 or more as many as possible (same as *) |&#x2F;\{-| \{-n,m} \{-n,m} n to m as few as possible \{-n} \{-n} n exactly \{-n,} \{-n,} at least n as few as possible \{-,m} \{-,m} 0 to m as few as possible \{-} \{-} 0 or more as few as possible |&#x2F;\@&gt;| \@&gt; \@&gt; 1, like matching a whole pattern </code></pre> So a non-greedy and atomic pattern would be `\%(a\{-}\)\@&gt;`. Unfortunately you can&#x27;t write `a\{-}\@&gt;` and have to wrap `a\{-}` into a non-capturing group though.<p>[1] <a href="https:&#x2F;&#x2F;vimhelp.org&#x2F;pattern.txt.html#pattern-overview" rel="nofollow">https:&#x2F;&#x2F;vimhelp.org&#x2F;pattern.txt.html#pattern-overview</a>
wodenokoto超过 4 年前
This one was pretty cool and new to me. I&#x27;m not sure if I would have read it as how it works or as &quot;Any one, but only one, of these, one or more times&quot;.<p><pre><code> (cat|dog|bird){1,2} will match any of the following strings: cat dog bird catcat catdog catbird dogcat dogdog dogbird birdcat birddog birdbird </code></pre> Regardless, its pretty neat!
compsciphd超过 4 年前
Story time about an interview experience I one had.<p>So I once interviewed at Google (well, I&#x27;ve interviewed at google many times, and have never gotten passed the hiring committee, but my resume looks decent so recruiters keep coming after me)<p>Anyways, on said interview, I was asked to program a relatively simple regex matcher (characters and star, perhaps ., don&#x27;t remember).<p>Anyways, I totally bombed that question (I&#x27;m a big believer that 45m isn&#x27;t enough time to organize ones thoughts on said Q as phrased, as it gives you nothing to build on, but I&#x27;ll get back to that in a bit). This of course through me off for the rest of the interview day, because I tend to chew on problems that I have trouble with and tend to not let go of them quickly until I have some better understanding of the issue (including perhaps my limitations in being able to do anything about it).<p>So after the interview, as I tend to do, I did a bit of research, and came across this article of Kernighan<p><a href="https:&#x2F;&#x2F;www.cs.princeton.edu&#x2F;courses&#x2F;archive&#x2F;spr09&#x2F;cos333&#x2F;beautiful.html" rel="nofollow">https:&#x2F;&#x2F;www.cs.princeton.edu&#x2F;courses&#x2F;archive&#x2F;spr09&#x2F;cos333&#x2F;be...</a><p>which put everything into context beautifully<p>now, why do I think its a terrible interview Q? Because as Kernighan writes<p><i>&quot;Rob disappeared into his office, and at least as I remember it now, appeared again in no more than an hour or two with the 30 lines of C code that subsequently appeared in Chapter 9 of TPOP.&quot;</i><p>if it would take Rob Pike and hour or 2 to come up with a nice solution, what chance do us mere mortals have in the context of an interview with people staring at us.<p>With that said, I do think this serves as a basis for a good interview question. Take a simple character by character string matcher as Rob Pike wrote, and ask the interviewee to expand on it, how would they add different functionality to it. 1) could they add &#x27;.&#x27;, 2) could they enable it to match anywhere in the text, 3) could they add anchors, 4) and if really good, could they add star support. You get to see how the interviewee thinks about code, thinks to modify code, and being regex, still provides a good basis for thinking about tests. I hope people I interview find this to be a tractable Q.<p>Getting back to my story, in order to prove to myself I wasn&#x27;t an idiot (and to learn some java, I was mostly a C programmer in the linux kernel at the time), I decided to take Rob Pike&#x27;s approach and try to implement as much of a regex matching engine as I could<p>which turned into <a href="https:&#x2F;&#x2F;github.com&#x2F;sjpotter&#x2F;regex" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;sjpotter&#x2F;regex</a>.<p>then when I needed to improve my Go programming skills (and explore possibly bad ways of doing Go programming), I ported it to <a href="https:&#x2F;&#x2F;github.com&#x2F;sjpotter&#x2F;regex-go" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;sjpotter&#x2F;regex-go</a><p>now, what&#x27;s the point of this story in regards to the article? Implementing the quantifiers gave me a better appreciation of how they worked, both greedy and not (and the performance impacts of them, especially with my naive implementation of a matcher).<p>By following <a href="https:&#x2F;&#x2F;www.regular-expressions.info&#x2F;reference.html" rel="nofollow">https:&#x2F;&#x2F;www.regular-expressions.info&#x2F;reference.html</a> and adding as much as I could (include backreferences, lookahead&#x2F;behinds and conditionals) it gave me a much better understanding of the power in regular expressions (at least of the PCRE type, I&#x27;m not sure my old discrete structures teacher would be happy calling all of these &quot;regular expressions&quot;).
评论 #24540660 未加载