A thread from 2016: <a href="https://news.ycombinator.com/item?id=12199836" rel="nofollow">https://news.ycombinator.com/item?id=12199836</a><p>2013: <a href="https://news.ycombinator.com/item?id=5672875" rel="nofollow">https://news.ycombinator.com/item?id=5672875</a><p>2011: <a href="https://news.ycombinator.com/item?id=2723366" rel="nofollow">https://news.ycombinator.com/item?id=2723366</a><p>Maybe others?<p>p.s. these links are just for curious readers; reposts are ok after a year or so—see <a href="https://news.ycombinator.com/newsfaq.html" rel="nofollow">https://news.ycombinator.com/newsfaq.html</a>.
This is chapter 1 of "Beautiful Code" (<a href="http://shop.oreilly.com/product/9780596510046.do" rel="nofollow">http://shop.oreilly.com/product/9780596510046.do</a>)
These posts are great for history, but regular expression implementation has moved on considerably from early Thompson implementations whether backtracking or NFAs. There is a considerable body of literature about regex implementation, including many quite convincing implementations and alternate formulations, some of which weren't even done by people at, or from, Bell Labs!<p>We seem to have something of an Eternal September of regex implementation knowledge (abetted by Russ Cox's amazingly selective bibliography in his posts introducing RE2).
I saw this in the repo of single file language implementations that was posted yesterday. Glad to see it on the front page now.<p>Here’s the repo if anyone is interested in checking out the others:<p><a href="https://github.com/marcpaq/b1fipl" rel="nofollow">https://github.com/marcpaq/b1fipl</a>
Skimming over it, the implementation looks inefficient:<p><pre><code> int matchstar(int c, char *regexp, char *text)
{
do { /* a * matches zero or more instances */
if (matchhere(regexp, text))
return 1;
} while (*text != '\0' && (*text++ == c || c == '.'));
return 0;
}
</code></pre>
That can be used to match short strings, but not to grep through a filesystem. A good regex matcher has runtime O(N*M) where N is the length of the regex (typically very short) and M is the length of the scanned text.
Seeing code like this I can understand why the pioneers though C code could be pretty.<p>But now all I see there is something more akin to a demonstration of ice carving with chainsaws than something that should be used in a production system.