Glob Matching Can Be Simple and Fast Too

333 pointsby secureabout 8 years ago

18 comments

js2about 8 years ago

I have not looked at the other linear-time implementations to see what they do, but I expect they all use one of these two approaches.Python's glob() (fnmatch really) translates the glob to a regular expression then uses its re library:<a href="https://github.com/python-git/python/blob/715a6e5035bb21ac49382772076ec4c630d6e960/Lib/fnmatch.py#L72" rel="nofollow">https://github.com/python-git/python/blob/715a6e5035bb21ac49...</a>

评论 #14185943 未加载

评论 #14186473 未加载

评论 #14186460 未加载

dexenabout 8 years ago

Previously: "Regular Expression Matching Can Be Simple And Fast" (2007) <a href="https://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp1.html</a> The paper deals with "Thompson NFA" approach to regex, with low computational complexity.Other Russ' papers on regular expression matching: <a href="https://swtch.com/~rsc/regexp/" rel="nofollow">https://swtch.com/~rsc/regexp/</a>

评论 #14187358 未加载

FreeFullabout 8 years ago

Interesting how the Rust implementation of glob currently seems to be the slowest out of the linear time implementations. I guess maybe not too much optimisation effort was put into it?

评论 #14186089 未加载

avarabout 8 years ago

There's another way for glob() implementations to mitigate these sort of patterns that Russ doesn't discuss, but can be inferred from a careful reading of the different examples in this new glob() article & the 2007 regex article.In the regex article he notes that e.g. perl is subject to pathological behavior when you match a?^na^n against an a^n:<pre><code> $ time perl -wE 'my $l = shift; my $str = "a" x $l; my $rx = "a?" x $l . $str; $str =~ /${rx}/' 28 real 0m13.278s </code></pre> However changing the pattern to /${rx}b/ makes it execute almost instantly. This is because the matcher will look ahead for fixed non-pattern strings found in the pattern, and deduce that whatever globbing we're trying to match now it can't possibly matter if the string doesn't have a "b" in it.I wonder if any globbing implementations take advantage of that class of optimization, and if there's any cases where Russ's suggested solution of not backtracking produces different results than you'd get by backtracking, in particular with some of the extended non-POSIX glob syntax out there.

评论 #14188746 未加载

评论 #14193658 未加载

eriknstrabout 8 years ago

OP, what version(s) of the BSD libc did you test? What OS, which version of the OS.macOS only? NetBSD? FreeBSD? OpenBSD?If you tested on FreeBSD, please file a bug at <a href="https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base%20System" rel="nofollow">https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base...</a>I'm not a project member but I'm a user of the system so it's in my interest that issues like this are resolved.Please let me know whether or not you file a bug so that if you do I don't duplicate bug reports and if you don't I can do some benchmarking myself.

评论 #14187948 未加载

评论 #14189300 未加载

avarabout 8 years ago

Slightly off-topic, but anyone know what he's using to generate those inline SVG graphs? I've been looking for some easy to use graphing library like that to present similar performance numbers on a webpage.

评论 #14187894 未加载

评论 #14186433 未加载

评论 #14186392 未加载

lexparabout 8 years ago

Not sure if OP is author, but if you are, just to inform you, there is a small typo in this paragraph:"Unfortunately, none of tehse protections address the cost of matching a single path element of a single file name. In 2005, CVE-2005-0256 was issued for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time finding even a single match during:"Very informative article. Thanks for it!

评论 #14186667 未加载

tyingqabout 8 years ago

The bsd derived glob has other functionality that I assume isn't simple or fast:<pre><code> perl -MFile::Glob=bsd_glob -e 'print bsd_glob("{{a,b,c}{1,2,3}{{yuck,Yuck},{urgh,URGH}}}\n")' </code></pre> Produces 36 lines representing all the iterations. Nest a bit deeper and it gets unwieldy.

评论 #14232691 未加载

评论 #14187964 未加载

mawekiabout 8 years ago

I wonder whether it would help to match from both sides (start and end) simultaneously, since you know you're not looking in the middle of the string. You also don't care about capture groups.

评论 #14189448 未加载

mixuabout 8 years ago

For fun, I ran this against node-glob ( <a href="https://github.com/isaacs/node-glob" rel="nofollow">https://github.com/isaacs/node-glob</a> ).Looks like it exhibits the slower behavior:<pre><code> n,elapsed 1,0.07 2,0.07 3,0.07 4,0.07 5,0.16 6,1.43 7,19.90 8,240.76 </code></pre> See this gist for the script <a href="https://gist.github.com/mixu/e4803da16e42439480eba2b29fa44484" rel="nofollow">https://gist.github.com/mixu/e4803da16e42439480eba2b29fa4448...</a>

JdeBPabout 8 years ago

> Graphical FTP clients typically use the MLST and MLSD commandsDo not count WWW browsers amongst the number of those graphical FTP clients. The common WWW browsers that speak FTP use LIST or LIST -l . With the exception of Google Chrome when it thinks that it is talking to a VMS program, they do not pass pattern arguments, though.

libre-manabout 8 years ago

I tested Common Lisp. SBCL seems to be exponential while Clozure CL is not.However it should be noted that it is non portable to do globbing in Common Lisp, so I expect most users implement it using something CL-FAD or OSICAT and CL-PPCRE, and CL-PPCRE is efficient.

E6300about 8 years ago

I've been playing around with my own glob implementation. From what I've seen, the simplified algorithm mentioned in the article wouldn't be able to handle question marks. In particular, I don't think a non-backtracking algorithm can handle a pattern like "?a?a?a?a?b". I've been working to minimize the worst-case behavior, but it's tricky.

评论 #14192214 未加载

评论 #14193703 未加载

mlghabout 8 years ago

Sorry, but the implementation posted is O(|pattern| * |name|), not linear. <a href="http://ideone.com/2xCXyY" rel="nofollow">http://ideone.com/2xCXyY</a>

评论 #14198783 未加载

jankedeenabout 8 years ago

How about the default sort? Ouch or no ouch?

BuuQu9huabout 8 years ago

We independently reinvented an adaptation of this algorithm for Monte's "simple" quasiliteral, which does simple string interpolation and matching. The code at <a href="https://github.com/monte-language/typhon/blob/master/mast/prelude/simple.mt#L68-L121" rel="nofollow">https://github.com/monte-language/typhon/blob/master/mast/pr...</a> is somewhat similar in appearance and structure to the examples in the post.<pre><code> def name := "Hackernews" # greeting == "Hi Hackernews!" def greeting := `Hi $name!` # language == "Lojban" def `@language is awesome` := "Lojban is awesome" </code></pre> A quirk of our presentation is that adjacent zero-or-more patterns degenerate, with each subsequent pattern matching the empty string. This mirrors the observation in the post that some systems can coalesce adjacent stars without changing the semantics:<pre><code> # one == "", two == "cool" def `adjacent @one@two patterns` := "adjacent cool patterns"</code></pre>

评论 #14185412 未加载

oconnoreabout 8 years ago

Why write a glob engine at all when you already have a fast regex implementation that can match both exact paths and plausible subtrees?The bulk of the haskell code to do this:<pre><code> parseGlob :: Char -> Char -> String -> Parser Glob parseGlob escC sepC forbid = many1' (gpart <|> sep <|> glob <|> alt) >>= return . GGroup . V.fromList where gpart = globPart escC (sepC : (forbid ++ "{*")) >>= return . GPart sep = satisfy (== ch2word sepC) >> return GSeparator alt = do _ <- AttoC.char '{' choices <- sepBy' (GEmpty `option` parseGlob escC sepC (",}" ++ forbid)) (char ',') _ <- AttoC.char '}' return $ GAlternate $ V.fromList choices glob = do res <- takeWhile1 (== ch2word '*') if B.length res == 1 then return GSingle else return GDouble wrapParens s = T.concat ["(", s, ")"] globRegex :: Char -> Glob -> T.Text globRegex sep GSingle = T.concat ["([^", T.singleton sep, "]*|\\", T.singleton sep, ")"] globRegex _ GDouble = ".*" globRegex _ GEmpty = "" globRegex sep GSeparator = T.singleton sep globRegex sep (GRepeat a) = T.concat ["(", T.concat (V.toList $ fmap (globRegex sep) a), ")*"] globRegex sep (GGroup a) = T.concat $ V.toList $ fmap (globRegex sep) a globRegex _ (GPart p) = T.concatMap efun base where base = TE.decodeUtf8 p escChars = S.fromList ".[]()\\{}^$*+" efun c = if S.member c escChars then T.concat ["\\", T.singleton c] else T.singleton c globRegex sep (GAlternate a) = if V.null alts then "" else T.concat [altsStr, if hasEmpty then "?" else ""] where hasEmpty = isJust $ V.find (== GEmpty) a alts = fmap (globRegex sep) $ V.filter (/= GEmpty) a altsStr = wrapParens $ T.intercalate "|" $ V.toList alts</code></pre>

评论 #14185686 未加载

评论 #14187400 未加载

gwu78about 8 years ago

<a href="https://github.com/skarnet/execline/raw/master/src/execline/elglob.c" rel="nofollow">https://github.com/skarnet/execline/raw/master/src/execline/...</a><a href="https://github.com/skarnet/execline/raw/master/src/libexecline/exlsn_elglob.c" rel="nofollow">https://github.com/skarnet/execline/raw/master/src/libexecli...</a>Simple.<a href="http://www.in-ulm.de/~mascheck/various/argmax/" rel="nofollow">http://www.in-ulm.de/~mascheck/various/argmax/</a><pre><code> execlineb -c 'elglob a /*/*/*/* ls $a' </code></pre> (statically-linked execlineb)If I am not mistken, ARG_MAX will be the limit.Straightforward.

评论 #14187111 未加载

18 comments

js2about 8 years ago

评论 #14185943 未加载

评论 #14186473 未加载

评论 #14186460 未加载

dexenabout 8 years ago

评论 #14187358 未加载

FreeFullabout 8 years ago

Interesting how the Rust implementation of glob currently seems to be the slowest out of the linear time implementations. I guess maybe not too much optimisation effort was put into it?

评论 #14186089 未加载

avarabout 8 years ago

评论 #14188746 未加载

评论 #14193658 未加载

eriknstrabout 8 years ago

评论 #14187948 未加载

评论 #14189300 未加载

avarabout 8 years ago

评论 #14187894 未加载

评论 #14186433 未加载

评论 #14186392 未加载

lexparabout 8 years ago

评论 #14186667 未加载

tyingqabout 8 years ago

评论 #14232691 未加载

评论 #14187964 未加载

mawekiabout 8 years ago

I wonder whether it would help to match from both sides (start and end) simultaneously, since you know you're not looking in the middle of the string. You also don't care about capture groups.

评论 #14189448 未加载

mixuabout 8 years ago

JdeBPabout 8 years ago

libre-manabout 8 years ago

E6300about 8 years ago

评论 #14192214 未加载

评论 #14193703 未加载

mlghabout 8 years ago

Sorry, but the implementation posted is O(|pattern| * |name|), not linear. <a href="http://ideone.com/2xCXyY" rel="nofollow">http://ideone.com/2xCXyY</a>