科技回声

1 comment

dasht大约 15 年前

Mr. Cox generally speaking has his act together on regular expressions - he's more or less joined the very small number of real experts on the topic. I especially admire his courage in promoting a regular expression engine that disregards popular features not found in true regular languages, such as backreferences.That sed, some pedantic gripes:1. It is not true (or is at least confusingly stated) that searches are linear (per the project page) in space and time in the lengths of the regular expression and the length of the target string. The simpler and more accurate statement is that searches are (worst case) linear time-wise in the product of the length of regexp and the search string, and O(1) for space.2. Credit where credit is due: the caching-DFA technique was first widely popularized in old editions of the so-called Dragon Book (Aho, Sethi, and Ullman's "Compilers: Principles, Techniques, and Tools." My copy was published in 1985. In personal correspondence with some old Bell Labs guys, I'm informed that someone (Thompson, as I recall) puzzled out the technique in the 1970s or perhaps the very early 1980s - but set it aside as being too hairy for their needs at the time. Cox should ask Pike to correct my memory on that. The technique has been practiced in a number of matchers since that time. In fact, an O(1) memory DFA caching implementation has been available for more than a decade as free software although its author (uh, me) hastens to add that he has little doubt that Mr. Cox has an implementation with novel charms that was well worth doing. My implementation got too hairy in some ways.3. Perhaps it will change over time but in this release, at least per the documentation, when the cache of DFA states is filled, the entire cache is discarded. This nearly needlessly limits the range of applicability of the matcher. An incremental approach to cache clearing using a (perhaps weighted) LRU strategy would expand the reach of this implementation at relatively little cost. This is not to say that Mr. Cox made a bad decision taking the simpler approach first (mumble mumble premature optimization mumble evil mumble mumble just constant factors something something something the dark side).4. His inner loop looks like it uses way more instructions than are reasonably called for although, to be sure, to be certain of that I'd have to know what requirements he's constrained to satisfy. When running out of the DFA cache, my implementation could (at least when last measured) get by on something like 12 or 20 instructions per character in the string being searched. Perhaps I'm mis-reading the code but this thing looks like it has much higher "constant factors".To be clear: I'm a fan of Mr. Cox's series of articles (written over several years!) about regexps. I like his general approach so much that I would wish I'd tried it first (except that I did, just wrong place, wrong time). I think the topic is sufficiently cool that I mention the above pedantic points mainly for the benefit of third parties, to give them some additional entry points to learn about it. Not trying to damn Mr. Cox at all, yadda yadda. On the contrary, he seems to have arrived in the weird little regular expression engine zone. Nice calling card.

RE2: robust regexp library by Google

1 comment

RE2: robust regexp library by Google

1 comment