<i>I have not looked at the other linear-time implementations to see what they do, but I expect they all use one of these two approaches.</i><p>Python's glob() (fnmatch really) translates the glob to a regular expression then uses its re library:<p><a href="https://github.com/python-git/python/blob/715a6e5035bb21ac49382772076ec4c630d6e960/Lib/fnmatch.py#L72" rel="nofollow">https://github.com/python-git/python/blob/715a6e5035bb21ac49...</a>
Previously: "Regular Expression Matching Can Be Simple And Fast" (2007) <a href="https://swtch.com/~rsc/regexp/regexp1.html" rel="nofollow">https://swtch.com/~rsc/regexp/regexp1.html</a>
The paper deals with "Thompson NFA" approach to regex, with low computational complexity.<p>Other Russ' papers on regular expression matching: <a href="https://swtch.com/~rsc/regexp/" rel="nofollow">https://swtch.com/~rsc/regexp/</a>
Interesting how the Rust implementation of glob currently seems to be the slowest out of the linear time implementations. I guess maybe not too much optimisation effort was put into it?
There's another way for glob() implementations to mitigate these sort of patterns that Russ doesn't discuss, but can be inferred from a careful reading of the different examples in this new glob() article & the 2007 regex article.<p>In the regex article he notes that e.g. perl is subject to pathological behavior when you match a?^na^n against an a^n:<p><pre><code> $ time perl -wE 'my $l = shift; my $str = "a" x $l; my $rx = "a?" x $l . $str; $str =~ /${rx}/' 28
real 0m13.278s
</code></pre>
However changing the pattern to /${rx}b/ makes it execute almost instantly. This is because the matcher will look ahead for fixed non-pattern strings found in the pattern, and deduce that whatever globbing we're trying to match now it can't possibly matter if the string doesn't have a "b" in it.<p>I wonder if any globbing implementations take advantage of that class of optimization, and if there's any cases where Russ's suggested solution of not backtracking produces different results than you'd get by backtracking, in particular with some of the extended non-POSIX glob syntax out there.
OP, what version(s) of the BSD libc did you test? What OS, which version of the OS.<p>macOS only? NetBSD? FreeBSD? OpenBSD?<p>If you tested on FreeBSD, please file a bug at <a href="https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base%20System" rel="nofollow">https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base...</a><p>I'm not a project member but I'm a user of the system so it's in my interest that issues like this are resolved.<p>Please let me know whether or not you file a bug so that if you do I don't duplicate bug reports and if you don't I can do some benchmarking myself.
Slightly off-topic, but anyone know what he's using to generate those inline SVG graphs? I've been looking for some easy to use graphing library like that to present similar performance numbers on a webpage.
Not sure if OP is author, but if you are, just to inform you, there is a small typo in this paragraph:<p>"Unfortunately, none of <i>tehse</i> protections address the cost of matching a single path element of a single file name. In 2005, CVE-2005-0256 was issued for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time finding even a single match during:"<p>Very informative article. Thanks for it!
The bsd derived glob has other functionality that I assume isn't simple or fast:<p><pre><code> perl -MFile::Glob=bsd_glob -e 'print bsd_glob("{{a,b,c}{1,2,3}{{yuck,Yuck},{urgh,URGH}}}\n")'
</code></pre>
Produces 36 lines representing all the iterations. Nest a bit deeper and it gets unwieldy.
I wonder whether it would help to match from both sides (start and end) simultaneously, since you know you're not looking in the middle of the string. You also don't care about capture groups.
For fun, I ran this against node-glob ( <a href="https://github.com/isaacs/node-glob" rel="nofollow">https://github.com/isaacs/node-glob</a> ).<p>Looks like it exhibits the slower behavior:<p><pre><code> n,elapsed
1,0.07
2,0.07
3,0.07
4,0.07
5,0.16
6,1.43
7,19.90
8,240.76
</code></pre>
See this gist for the script <a href="https://gist.github.com/mixu/e4803da16e42439480eba2b29fa44484" rel="nofollow">https://gist.github.com/mixu/e4803da16e42439480eba2b29fa4448...</a>
> <i>Graphical FTP clients typically use the MLST and MLSD commands</i><p>Do not count WWW browsers amongst the number of those graphical FTP clients. The common WWW browsers that speak FTP use LIST or LIST -l . With the exception of Google Chrome when it thinks that it is talking to a VMS program, they do not pass pattern arguments, though.
I tested Common Lisp. SBCL seems to be exponential while Clozure CL is not.<p>However it should be noted that it is non portable to do globbing in Common Lisp, so I expect most users implement it using something CL-FAD or OSICAT and CL-PPCRE, and CL-PPCRE is efficient.
I've been playing around with my own glob implementation. From what I've seen, the simplified algorithm mentioned in the article wouldn't be able to handle question marks. In particular, I don't think a non-backtracking algorithm can handle a pattern like "?a<i>?a</i>?a<i>?a</i>?b". I've been working to minimize the worst-case behavior, but it's tricky.
Sorry, but the implementation posted is O(|pattern| * |name|), not linear.
<a href="http://ideone.com/2xCXyY" rel="nofollow">http://ideone.com/2xCXyY</a>
We independently reinvented an adaptation of this algorithm for Monte's "simple" quasiliteral, which does simple string interpolation and matching. The code at <a href="https://github.com/monte-language/typhon/blob/master/mast/prelude/simple.mt#L68-L121" rel="nofollow">https://github.com/monte-language/typhon/blob/master/mast/pr...</a> is somewhat similar in appearance and structure to the examples in the post.<p><pre><code> def name := "Hackernews"
# greeting == "Hi Hackernews!"
def greeting := `Hi $name!`
# language == "Lojban"
def `@language is awesome` := "Lojban is awesome"
</code></pre>
A quirk of our presentation is that adjacent zero-or-more patterns degenerate, with each subsequent pattern matching the empty string. This mirrors the observation in the post that some systems can coalesce adjacent stars without changing the semantics:<p><pre><code> # one == "", two == "cool"
def `adjacent @one@two patterns` := "adjacent cool patterns"</code></pre>
<a href="https://github.com/skarnet/execline/raw/master/src/execline/elglob.c" rel="nofollow">https://github.com/skarnet/execline/raw/master/src/execline/...</a><p><a href="https://github.com/skarnet/execline/raw/master/src/libexecline/exlsn_elglob.c" rel="nofollow">https://github.com/skarnet/execline/raw/master/src/libexecli...</a><p>Simple.<p><a href="http://www.in-ulm.de/~mascheck/various/argmax/" rel="nofollow">http://www.in-ulm.de/~mascheck/various/argmax/</a><p><pre><code> execlineb -c 'elglob a /*/*/*/* ls $a'
</code></pre>
(statically-linked execlineb)<p>If I am not mistken, ARG_MAX will be the limit.<p>Straightforward.