There was an interesting discussion two years ago regarding nonobvious issues with PEGs:<p><a href="https://news.ycombinator.com/item?id=30414683">https://news.ycombinator.com/item?id=30414683</a><p><a href="https://news.ycombinator.com/item?id=30414879">https://news.ycombinator.com/item?id=30414879</a><p>I spent a year or two working with PEGs, and ran into similar issues multiple times. Adding a new production could totally screw up seemingly unrelated parses that worked fine before.<p>As the author points out, Earley parsing with some disambiguation rules (production precedence, etc.) has been much less finicky/annoying to work with. It's also reasonably fast for small parses even with a dumb implementation. Would suggest for prototyping/settings when runtime ambiguity is not a showstopper, despite the remaining issues described in the article re: having a separate lexer.
Parsing computer languages is an entirely self-inflicted problem. You can easily design a language so it doesn't require any parsing techniques that were not known and practical in 1965, and it will greatly benefit the readability also.
Related:<p><i>Parsing: The Solved Problem That Isn't (2011)</i> - <a href="https://news.ycombinator.com/item?id=8505382">https://news.ycombinator.com/item?id=8505382</a> - Oct 2014 (70 comments)<p><i>Parsing: the solved problem that isn't</i> - <a href="https://news.ycombinator.com/item?id=2327313">https://news.ycombinator.com/item?id=2327313</a> - March 2011 (47 comments)
After using Instaparse at least it felt like a solved problem: <a href="https://github.com/Engelberg/instaparse">https://github.com/Engelberg/instaparse</a>
What I find annoying about using parser generators is that it always feels messy integrating the resulting parser into your application. So you write a file that contains the grammar and generate a parser out of that. Now you build it into your app and call into it to parse some input file, but that ends up giving you some poorly typed AST that is cluttered/hard to work with.<p>Certain parser generators make life easier by supporting actions on parser/lexer rules. This is great and all, but it has the downside that the grammar you provide is no longer reusable. There's no way for others to import that grammar and provide custom actions for them.<p>I don't know. In my opinion parsing theory is already solved. Whether it's PEG, LL, LR, LALR, whatever. One of those is certainly good enough for the kind of data you're trying to parse. I think the biggest annoyance is the tooling.
Common example of complications of two grammars being combined: C code and character strings.<p>Double quotes in C code mean begin and end of a string. But strings contain quotes too. And newlines. Etc.<p>So we got the cumbersome invention of escape codes, and so characters strings in source (itself a character string) are not literally the strings they represent.
My current view of what makes parsing so difficult is that people want to jump straight over a ton of intermediate things from parsing to execution. That is, we often know what we want to happen at the end. And we know what we are given. It is hoped that it is a trivially mechanical problem to go from one to the other.<p>But this ignores all sorts of other steps you can take. Targeting multiple execution environments is an obvious step. Optimization is another. Trivial local optimizations like shifts over multiplications by 2 and fusing operations to take advantage of the machine that is executing it. Less trivial full program optimizations that can propagate constants across source files.<p>And preemptive execution is a huge consideration, of course. Very little code runs in a way that can't be interrupted for some other code to run in the meantime. To the point that we don't even think of what this implies anymore. Despite accumulators being a very basic execution unit on most every computer. (Though, I think I'm thankful that reentrancy is the norm nowadays in functions.)