<i>> However, in practice it turns out that LALR(1) isn’t always the answer. For instance, both GCC, Rust and Go have chosen to use handwritten parsers that are not based on a declarative grammar file. I find this disappointing: We have decades of experience with specifying languages in a declarative format, and apparently it’s still easier to manually write parsers. Obviously there’s something lacking with the algorithms we have today.</i><p>Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.<p>Consider the example of missing parenthesis. I can specify a grammar with the following:<p><pre><code> EXPR =
...
| '(' EXPR ')'
</code></pre>
The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user's input is not formatted as expected, which makes it hard for the user to understand what went wrong.<p>It is possible to modify the grammar to look for this case specifically:<p><pre><code> EXPR =
...
| '(' EXPR ')'
| '(' EXPR // report this explicitly
</code></pre>
But now I've just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.<p>So, many production compilers end-up with custom parsers to report unexpected input in a sane way. It is not that the algorithms are lacking; there is simply no mechanism for a generator to know what a mismatched parenthesis should be reported as.
Sounds promising! And a great exposition of the background and development.<p>About the Glushkov construction for regular expressions, here's some Python code that may <i>perhaps</i> help to explain it, since the Wikipedia page is hard to follow (at least for me) and the Mastodon thread linked in the post admitted to some confusion about the "Play on Regular Expressions" paper (which I haven't read): <a href="https://github.com/darius/sketchbook/blob/master/regex/nfa_position.py" rel="nofollow">https://github.com/darius/sketchbook/blob/master/regex/nfa_p...</a><p>Sorry it's not pedagogically great either; I was rederiving the method for myself rather than implementing from textbooks. What I called 'follows' in the comment should correspond to 'pair set' in the post; the other fields have the same names.<p>Incidentally I think the post a bit overstates the problems with PEGs (you shouldn't need left recursion anywhere a hand-written recursive-descent parser would also work; PEGs were meant to streamline and formalize recursive descent, and this failure to model the practice can be fixed without big changes) -- but CFGs are great too and I'm looking forward to studying Glush. I wish I'd thought of exploring the generalization to context-free grammars.
That sounds really cool, and I hope it will find success. The field of parsing truly needs some refreshment, especially in usability.<p>I do have a few questions:<p>1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?<p>2. How is Glush's performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)<p>3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?
Thanks for sharing. Currently I'm looking for an elegant language for querying Json structures. In addition to JsonPath, JMESPath, Glush & GROQ is definitely worth a try
Skipping to "Glush, the algorithm" can be helpful if you already know the history of parsing.<p>TL;DR: Glush grammars consist of "rules" that are regexes + recursion of rule calls. Compiles to NFAs.<p>I'm not sure I see the point in declaring precedence levels inside rules -- maybe it's just a preference thing, but I like having operator precedences in a separate section of the grammar. Yacc does this. Megaparsec for Haskell does this.<p>This reminds me of two things:<p>1. How syntax highlighting is implemented in Vim and GEdit (which is to say GtkSourceView). See for example how "contexts" can be nested here: <a href="https://github.com/rubencaro/gedit_hacks/blob/master/.local/share/gtksourceview-3.0/language-specs/ruby.lang#L469" rel="nofollow">https://github.com/rubencaro/gedit_hacks/blob/master/.local/...</a><p>2. Kleenex: <a href="https://kleenexlang.org" rel="nofollow">https://kleenexlang.org</a> - <a href="https://github.com/diku-kmc/kleenexlang" rel="nofollow">https://github.com/diku-kmc/kleenexlang</a> - a silly example: <a href="https://github.com/athas/EggsML/blob/master/concieggs/compiled/makeDanish.kex" rel="nofollow">https://github.com/athas/EggsML/blob/master/concieggs/compil...</a><p>I am tempted to classify this as "regex extended with recursion" rather than "an alternative to CFGs", since I cannot decipher from the article the exact expressive power.
Nice article, this looks like a useful parser generator. Though, as others have mentioned, error reporting is a very important point that should be considered even at this early stage.<p>Here's something that surprised me though:<p>> a parser toolkit that lets you create efficient parsers in multiple languages (currently JavaScript, Go, and Ruby)<p>> [some other approach] it looked very interesting, but it’s not completely smooth. First of all, all of the implementations have been in highly expressive, garbage collected languages (Haskell, Scala, Racket).<p>This isn't the author's only reason to dismiss that alternative approach, but it's still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC'd languages, it's premature pessimization to worry about how other, hypothetical, targets might be accommodated.
No mention of Tree-sitter (<a href="http://tree-sitter.github.io/tree-sitter/" rel="nofollow">http://tree-sitter.github.io/tree-sitter/</a>), which this looks very similar to.
I would have <i>really</i> preferred a table of contents, either at the top or in a sidebar. It would have made it easier to skip over all the background materials that essentially rehashes an undergrad complier course.