TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Glob Matching Can Be Simple and Fast Too

333 pointsby secureabout 8 years ago

18 comments

js2about 8 years ago
<i>I have not looked at the other linear-time implementations to see what they do, but I expect they all use one of these two approaches.</i><p>Python&#x27;s glob() (fnmatch really) translates the glob to a regular expression then uses its re library:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;python-git&#x2F;python&#x2F;blob&#x2F;715a6e5035bb21ac49382772076ec4c630d6e960&#x2F;Lib&#x2F;fnmatch.py#L72" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;python-git&#x2F;python&#x2F;blob&#x2F;715a6e5035bb21ac49...</a>
评论 #14185943 未加载
评论 #14186473 未加载
评论 #14186460 未加载
dexenabout 8 years ago
Previously: &quot;Regular Expression Matching Can Be Simple And Fast&quot; (2007) <a href="https:&#x2F;&#x2F;swtch.com&#x2F;~rsc&#x2F;regexp&#x2F;regexp1.html" rel="nofollow">https:&#x2F;&#x2F;swtch.com&#x2F;~rsc&#x2F;regexp&#x2F;regexp1.html</a> The paper deals with &quot;Thompson NFA&quot; approach to regex, with low computational complexity.<p>Other Russ&#x27; papers on regular expression matching: <a href="https:&#x2F;&#x2F;swtch.com&#x2F;~rsc&#x2F;regexp&#x2F;" rel="nofollow">https:&#x2F;&#x2F;swtch.com&#x2F;~rsc&#x2F;regexp&#x2F;</a>
评论 #14187358 未加载
FreeFullabout 8 years ago
Interesting how the Rust implementation of glob currently seems to be the slowest out of the linear time implementations. I guess maybe not too much optimisation effort was put into it?
评论 #14186089 未加载
avarabout 8 years ago
There&#x27;s another way for glob() implementations to mitigate these sort of patterns that Russ doesn&#x27;t discuss, but can be inferred from a careful reading of the different examples in this new glob() article &amp; the 2007 regex article.<p>In the regex article he notes that e.g. perl is subject to pathological behavior when you match a?^na^n against an a^n:<p><pre><code> $ time perl -wE &#x27;my $l = shift; my $str = &quot;a&quot; x $l; my $rx = &quot;a?&quot; x $l . $str; $str =~ &#x2F;${rx}&#x2F;&#x27; 28 real 0m13.278s </code></pre> However changing the pattern to &#x2F;${rx}b&#x2F; makes it execute almost instantly. This is because the matcher will look ahead for fixed non-pattern strings found in the pattern, and deduce that whatever globbing we&#x27;re trying to match now it can&#x27;t possibly matter if the string doesn&#x27;t have a &quot;b&quot; in it.<p>I wonder if any globbing implementations take advantage of that class of optimization, and if there&#x27;s any cases where Russ&#x27;s suggested solution of not backtracking produces different results than you&#x27;d get by backtracking, in particular with some of the extended non-POSIX glob syntax out there.
评论 #14188746 未加载
评论 #14193658 未加载
eriknstrabout 8 years ago
OP, what version(s) of the BSD libc did you test? What OS, which version of the OS.<p>macOS only? NetBSD? FreeBSD? OpenBSD?<p>If you tested on FreeBSD, please file a bug at <a href="https:&#x2F;&#x2F;bugs.freebsd.org&#x2F;bugzilla&#x2F;enter_bug.cgi?product=Base%20System" rel="nofollow">https:&#x2F;&#x2F;bugs.freebsd.org&#x2F;bugzilla&#x2F;enter_bug.cgi?product=Base...</a><p>I&#x27;m not a project member but I&#x27;m a user of the system so it&#x27;s in my interest that issues like this are resolved.<p>Please let me know whether or not you file a bug so that if you do I don&#x27;t duplicate bug reports and if you don&#x27;t I can do some benchmarking myself.
评论 #14187948 未加载
评论 #14189300 未加载
avarabout 8 years ago
Slightly off-topic, but anyone know what he&#x27;s using to generate those inline SVG graphs? I&#x27;ve been looking for some easy to use graphing library like that to present similar performance numbers on a webpage.
评论 #14187894 未加载
评论 #14186433 未加载
评论 #14186392 未加载
lexparabout 8 years ago
Not sure if OP is author, but if you are, just to inform you, there is a small typo in this paragraph:<p>&quot;Unfortunately, none of <i>tehse</i> protections address the cost of matching a single path element of a single file name. In 2005, CVE-2005-0256 was issued for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time finding even a single match during:&quot;<p>Very informative article. Thanks for it!
评论 #14186667 未加载
tyingqabout 8 years ago
The bsd derived glob has other functionality that I assume isn&#x27;t simple or fast:<p><pre><code> perl -MFile::Glob=bsd_glob -e &#x27;print bsd_glob(&quot;{{a,b,c}{1,2,3}{{yuck,Yuck},{urgh,URGH}}}\n&quot;)&#x27; </code></pre> Produces 36 lines representing all the iterations. Nest a bit deeper and it gets unwieldy.
评论 #14232691 未加载
评论 #14187964 未加载
mawekiabout 8 years ago
I wonder whether it would help to match from both sides (start and end) simultaneously, since you know you&#x27;re not looking in the middle of the string. You also don&#x27;t care about capture groups.
评论 #14189448 未加载
mixuabout 8 years ago
For fun, I ran this against node-glob ( <a href="https:&#x2F;&#x2F;github.com&#x2F;isaacs&#x2F;node-glob" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;isaacs&#x2F;node-glob</a> ).<p>Looks like it exhibits the slower behavior:<p><pre><code> n,elapsed 1,0.07 2,0.07 3,0.07 4,0.07 5,0.16 6,1.43 7,19.90 8,240.76 </code></pre> See this gist for the script <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;mixu&#x2F;e4803da16e42439480eba2b29fa44484" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;mixu&#x2F;e4803da16e42439480eba2b29fa4448...</a>
JdeBPabout 8 years ago
&gt; <i>Graphical FTP clients typically use the MLST and MLSD commands</i><p>Do not count WWW browsers amongst the number of those graphical FTP clients. The common WWW browsers that speak FTP use LIST or LIST -l . With the exception of Google Chrome when it thinks that it is talking to a VMS program, they do not pass pattern arguments, though.
libre-manabout 8 years ago
I tested Common Lisp. SBCL seems to be exponential while Clozure CL is not.<p>However it should be noted that it is non portable to do globbing in Common Lisp, so I expect most users implement it using something CL-FAD or OSICAT and CL-PPCRE, and CL-PPCRE is efficient.
E6300about 8 years ago
I&#x27;ve been playing around with my own glob implementation. From what I&#x27;ve seen, the simplified algorithm mentioned in the article wouldn&#x27;t be able to handle question marks. In particular, I don&#x27;t think a non-backtracking algorithm can handle a pattern like &quot;?a<i>?a</i>?a<i>?a</i>?b&quot;. I&#x27;ve been working to minimize the worst-case behavior, but it&#x27;s tricky.
评论 #14192214 未加载
评论 #14193703 未加载
mlghabout 8 years ago
Sorry, but the implementation posted is O(|pattern| * |name|), not linear. <a href="http:&#x2F;&#x2F;ideone.com&#x2F;2xCXyY" rel="nofollow">http:&#x2F;&#x2F;ideone.com&#x2F;2xCXyY</a>
评论 #14198783 未加载
jankedeenabout 8 years ago
How about the default sort? Ouch or no ouch?
BuuQu9huabout 8 years ago
We independently reinvented an adaptation of this algorithm for Monte&#x27;s &quot;simple&quot; quasiliteral, which does simple string interpolation and matching. The code at <a href="https:&#x2F;&#x2F;github.com&#x2F;monte-language&#x2F;typhon&#x2F;blob&#x2F;master&#x2F;mast&#x2F;prelude&#x2F;simple.mt#L68-L121" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;monte-language&#x2F;typhon&#x2F;blob&#x2F;master&#x2F;mast&#x2F;pr...</a> is somewhat similar in appearance and structure to the examples in the post.<p><pre><code> def name := &quot;Hackernews&quot; # greeting == &quot;Hi Hackernews!&quot; def greeting := `Hi $name!` # language == &quot;Lojban&quot; def `@language is awesome` := &quot;Lojban is awesome&quot; </code></pre> A quirk of our presentation is that adjacent zero-or-more patterns degenerate, with each subsequent pattern matching the empty string. This mirrors the observation in the post that some systems can coalesce adjacent stars without changing the semantics:<p><pre><code> # one == &quot;&quot;, two == &quot;cool&quot; def `adjacent @one@two patterns` := &quot;adjacent cool patterns&quot;</code></pre>
评论 #14185412 未加载
oconnoreabout 8 years ago
Why write a glob engine at all when you already have a fast regex implementation that can match both exact paths and plausible subtrees?<p>The bulk of the haskell code to do this:<p><pre><code> parseGlob :: Char -&gt; Char -&gt; String -&gt; Parser Glob parseGlob escC sepC forbid = many1&#x27; (gpart &lt;|&gt; sep &lt;|&gt; glob &lt;|&gt; alt) &gt;&gt;= return . GGroup . V.fromList where gpart = globPart escC (sepC : (forbid ++ &quot;{*&quot;)) &gt;&gt;= return . GPart sep = satisfy (== ch2word sepC) &gt;&gt; return GSeparator alt = do _ &lt;- AttoC.char &#x27;{&#x27; choices &lt;- sepBy&#x27; (GEmpty `option` parseGlob escC sepC (&quot;,}&quot; ++ forbid)) (char &#x27;,&#x27;) _ &lt;- AttoC.char &#x27;}&#x27; return $ GAlternate $ V.fromList choices glob = do res &lt;- takeWhile1 (== ch2word &#x27;*&#x27;) if B.length res == 1 then return GSingle else return GDouble wrapParens s = T.concat [&quot;(&quot;, s, &quot;)&quot;] globRegex :: Char -&gt; Glob -&gt; T.Text globRegex sep GSingle = T.concat [&quot;([^&quot;, T.singleton sep, &quot;]*|\\&quot;, T.singleton sep, &quot;)&quot;] globRegex _ GDouble = &quot;.*&quot; globRegex _ GEmpty = &quot;&quot; globRegex sep GSeparator = T.singleton sep globRegex sep (GRepeat a) = T.concat [&quot;(&quot;, T.concat (V.toList $ fmap (globRegex sep) a), &quot;)*&quot;] globRegex sep (GGroup a) = T.concat $ V.toList $ fmap (globRegex sep) a globRegex _ (GPart p) = T.concatMap efun base where base = TE.decodeUtf8 p escChars = S.fromList &quot;.[]()\\{}^$*+&quot; efun c = if S.member c escChars then T.concat [&quot;\\&quot;, T.singleton c] else T.singleton c globRegex sep (GAlternate a) = if V.null alts then &quot;&quot; else T.concat [altsStr, if hasEmpty then &quot;?&quot; else &quot;&quot;] where hasEmpty = isJust $ V.find (== GEmpty) a alts = fmap (globRegex sep) $ V.filter (&#x2F;= GEmpty) a altsStr = wrapParens $ T.intercalate &quot;|&quot; $ V.toList alts</code></pre>
评论 #14185686 未加载
评论 #14187400 未加载
gwu78about 8 years ago
<a href="https:&#x2F;&#x2F;github.com&#x2F;skarnet&#x2F;execline&#x2F;raw&#x2F;master&#x2F;src&#x2F;execline&#x2F;elglob.c" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;skarnet&#x2F;execline&#x2F;raw&#x2F;master&#x2F;src&#x2F;execline&#x2F;...</a><p><a href="https:&#x2F;&#x2F;github.com&#x2F;skarnet&#x2F;execline&#x2F;raw&#x2F;master&#x2F;src&#x2F;libexecline&#x2F;exlsn_elglob.c" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;skarnet&#x2F;execline&#x2F;raw&#x2F;master&#x2F;src&#x2F;libexecli...</a><p>Simple.<p><a href="http:&#x2F;&#x2F;www.in-ulm.de&#x2F;~mascheck&#x2F;various&#x2F;argmax&#x2F;" rel="nofollow">http:&#x2F;&#x2F;www.in-ulm.de&#x2F;~mascheck&#x2F;various&#x2F;argmax&#x2F;</a><p><pre><code> execlineb -c &#x27;elglob a &#x2F;*&#x2F;*&#x2F;*&#x2F;* ls $a&#x27; </code></pre> (statically-linked execlineb)<p>If I am not mistken, ARG_MAX will be the limit.<p>Straightforward.
评论 #14187111 未加载