TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Glush: A robust parser compiler built using non-deterministic automatons

84 pointsby evenwover 5 years ago

10 comments

chrisaycockover 5 years ago
<i>&gt; However, in practice it turns out that LALR(1) isn’t always the answer. For instance, both GCC, Rust and Go have chosen to use handwritten parsers that are not based on a declarative grammar file. I find this disappointing: We have decades of experience with specifying languages in a declarative format, and apparently it’s still easier to manually write parsers. Obviously there’s something lacking with the algorithms we have today.</i><p>Many production compilers have hand-written parsers because of error reporting, not because of issues with parser-generator algorithms.<p>Consider the example of missing parenthesis. I can specify a grammar with the following:<p><pre><code> EXPR = ... | &#x27;(&#x27; EXPR &#x27;)&#x27; </code></pre> The generated parser will not gracefully report a mismatched number of parentheses! Instead, it will report that the user&#x27;s input is not formatted as expected, which makes it hard for the user to understand what went wrong.<p>It is possible to modify the grammar to look for this case specifically:<p><pre><code> EXPR = ... | &#x27;(&#x27; EXPR &#x27;)&#x27; | &#x27;(&#x27; EXPR &#x2F;&#x2F; report this explicitly </code></pre> But now I&#x27;ve just polluted my once-pristine grammar file. And adding all of the necessary error checks throughout the grammar will really make things complicated.<p>So, many production compilers end-up with custom parsers to report unexpected input in a sane way. It is not that the algorithms are lacking; there is simply no mechanism for a generator to know what a mismatched parenthesis should be reported as.
评论 #21788875 未加载
评论 #21788220 未加载
评论 #21801461 未加载
评论 #21790082 未加载
评论 #21788773 未加载
abecedariusover 5 years ago
Sounds promising! And a great exposition of the background and development.<p>About the Glushkov construction for regular expressions, here&#x27;s some Python code that may <i>perhaps</i> help to explain it, since the Wikipedia page is hard to follow (at least for me) and the Mastodon thread linked in the post admitted to some confusion about the &quot;Play on Regular Expressions&quot; paper (which I haven&#x27;t read): <a href="https:&#x2F;&#x2F;github.com&#x2F;darius&#x2F;sketchbook&#x2F;blob&#x2F;master&#x2F;regex&#x2F;nfa_position.py" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;darius&#x2F;sketchbook&#x2F;blob&#x2F;master&#x2F;regex&#x2F;nfa_p...</a><p>Sorry it&#x27;s not pedagogically great either; I was rederiving the method for myself rather than implementing from textbooks. What I called &#x27;follows&#x27; in the comment should correspond to &#x27;pair set&#x27; in the post; the other fields have the same names.<p>Incidentally I think the post a bit overstates the problems with PEGs (you shouldn&#x27;t need left recursion anywhere a hand-written recursive-descent parser would also work; PEGs were meant to streamline and formalize recursive descent, and this failure to model the practice can be fixed without big changes) -- but CFGs are great too and I&#x27;m looking forward to studying Glush. I wish I&#x27;d thought of exploring the generalization to context-free grammars.
erezshover 5 years ago
That sounds really cool, and I hope it will find success. The field of parsing truly needs some refreshment, especially in usability.<p>I do have a few questions:<p>1. You claim the algorithm used guarantees O(n^3) for all CFGs, but is different from Earley and CYK. Are there more details for this algorithm, perhaps with a rigorous mathematical proof for this claim?<p>2. How is Glush&#x27;s performance in relation to Earley, in terms of constant time and memory requirements? (even for low complexity, a high constant can be problematic)<p>3. How do you deal with ambiguity? The blog mentions priority, but is there a way to get something similar to a SPPF?
评论 #21801401 未加载
anhldbkover 5 years ago
Thanks for sharing. Currently I&#x27;m looking for an elegant language for querying Json structures. In addition to JsonPath, JMESPath, Glush &amp; GROQ is definitely worth a try
评论 #21787711 未加载
评论 #21788420 未加载
评论 #21787537 未加载
sshineover 5 years ago
Skipping to &quot;Glush, the algorithm&quot; can be helpful if you already know the history of parsing.<p>TL;DR: Glush grammars consist of &quot;rules&quot; that are regexes + recursion of rule calls. Compiles to NFAs.<p>I&#x27;m not sure I see the point in declaring precedence levels inside rules -- maybe it&#x27;s just a preference thing, but I like having operator precedences in a separate section of the grammar. Yacc does this. Megaparsec for Haskell does this.<p>This reminds me of two things:<p>1. How syntax highlighting is implemented in Vim and GEdit (which is to say GtkSourceView). See for example how &quot;contexts&quot; can be nested here: <a href="https:&#x2F;&#x2F;github.com&#x2F;rubencaro&#x2F;gedit_hacks&#x2F;blob&#x2F;master&#x2F;.local&#x2F;share&#x2F;gtksourceview-3.0&#x2F;language-specs&#x2F;ruby.lang#L469" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;rubencaro&#x2F;gedit_hacks&#x2F;blob&#x2F;master&#x2F;.local&#x2F;...</a><p>2. Kleenex: <a href="https:&#x2F;&#x2F;kleenexlang.org" rel="nofollow">https:&#x2F;&#x2F;kleenexlang.org</a> - <a href="https:&#x2F;&#x2F;github.com&#x2F;diku-kmc&#x2F;kleenexlang" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;diku-kmc&#x2F;kleenexlang</a> - a silly example: <a href="https:&#x2F;&#x2F;github.com&#x2F;athas&#x2F;EggsML&#x2F;blob&#x2F;master&#x2F;concieggs&#x2F;compiled&#x2F;makeDanish.kex" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;athas&#x2F;EggsML&#x2F;blob&#x2F;master&#x2F;concieggs&#x2F;compil...</a><p>I am tempted to classify this as &quot;regex extended with recursion&quot; rather than &quot;an alternative to CFGs&quot;, since I cannot decipher from the article the exact expressive power.
评论 #21788969 未加载
tom_melliorover 5 years ago
Nice article, this looks like a useful parser generator. Though, as others have mentioned, error reporting is a very important point that should be considered even at this early stage.<p>Here&#x27;s something that surprised me though:<p>&gt; a parser toolkit that lets you create efficient parsers in multiple languages (currently JavaScript, Go, and Ruby)<p>&gt; [some other approach] it looked very interesting, but it’s not completely smooth. First of all, all of the implementations have been in highly expressive, garbage collected languages (Haskell, Scala, Racket).<p>This isn&#x27;t the author&#x27;s only reason to dismiss that alternative approach, but it&#x27;s still a strange point to spend almost an entire paragraph on. If all the targets the author currently envisions are GC&#x27;d languages, it&#x27;s premature pessimization to worry about how other, hypothetical, targets might be accommodated.
评论 #21801504 未加载
IshKebabover 5 years ago
No mention of Tree-sitter (<a href="http:&#x2F;&#x2F;tree-sitter.github.io&#x2F;tree-sitter&#x2F;" rel="nofollow">http:&#x2F;&#x2F;tree-sitter.github.io&#x2F;tree-sitter&#x2F;</a>), which this looks very similar to.
vinodkdover 5 years ago
Fascinating journey through various parsing techniques, aside from describing glush itelf. bookmarked!
7532yahoogmailover 5 years ago
Intrigued .. keep posting. I tried a bison&#x2F;flex parser and was stuck in shift&#x2F;reduce hell.
评论 #21787819 未加载
lopsidedBrainover 5 years ago
I would have <i>really</i> preferred a table of contents, either at the top or in a sidebar. It would have made it easier to skip over all the background materials that essentially rehashes an undergrad complier course.
评论 #21787805 未加载
评论 #21787776 未加载
评论 #21787768 未加载