Ohm: Parsing Made Easy

248 点作者 pdubroy超过 7 年前

17 条评论

pdubroy超过 7 年前

Hi HN, I'm a researcher at HARC (<a href="https://harc.ycr.org/" rel="nofollow">https://harc.ycr.org/</a>) and one of the authors of Ohm. We've used it to power several of our programming language investigations, such as Seymour (which was on HN yesterday: <a href="https://news.ycombinator.com/item?id=15471954" rel="nofollow">https://news.ycombinator.com/item?id=15471954</a>) and Chorus (<a href="http://www.chorus-home.org/" rel="nofollow">http://www.chorus-home.org/</a>).If you're interested, here's the grammar for the language used in the Seymour demo: <a href="https://github.com/harc/seymour/blob/6f55361ad3410f42f67f183da7b7549418884e50/lang/grammar.js" rel="nofollow">https://github.com/harc/seymour/blob/6f55361ad3410f42f67f183...</a>Happy to answer any questions that you have!

评论 #15492766 未加载

评论 #15494187 未加载

评论 #15492172 未加载

评论 #15492065 未加载

评论 #15494593 未加载

评论 #15492726 未加载

bd82超过 7 年前

Ohm is very impressive.Specifically:<pre><code> 1. The separation of Grammar and Semantics. 2. Handling left recursion in a top down (peg) parser. 3. Incremental parsing. </code></pre> I think that the one feature missing to make it applicable for more than rapid prototyping and teaching purposes is performance.In this benchmark I've authored: <a href="http://sap.github.io/chevrotain/performance/" rel="nofollow">http://sap.github.io/chevrotain/performance/</a> Which uses the simple JSON grammar it is about two orders of magnitudes slower than most other parsing libraries in JavaScript.So I am sure there is a great deal of room for optimizations.

评论 #15493196 未加载

iamleppert超过 7 年前

I've really tried to get on with parser generators, but I've found they are hard to use, hard to debug and the languages/DSLs are clunky and weird. Except for cleanroom academic implementations, or for language designers who can afford the time and resources to learn and get good at a parser generator, I've found its better to simply use regular expressions to do matching and a functional language that can build up a data structure recursively.Another problem is a lot of them resort to clunky code generation from the grammar file, and when something goes wrong you're not debugging the grammar per se, you are stepping through a bunch of machine generated code that you didn't write yourself. So your debug process looks like make change to grammar, regenerate parser, try parsing again, loop. etc. It replaces the entire file too, so its not like you can isolate areas of the code and work on them like you would regular code. And the time to generate the parser is often times slow.Also when runtime parsing errors do happen, often the incorrect line/column numbers are reported, and getting good descriptive parser errors is a project in and of itself after you have your grammar written and working.

评论 #15493400 未加载

评论 #15493351 未加载

评论 #15493918 未加载

评论 #15493481 未加载

kvlr超过 7 年前

Hey, I’m one of the founders of Nextjournal, the coding, writing and publishing platform this article was written in.This probably isn’t obvious: you can get a copy of the article and play with it if you click remix and sign in/up.There’s some more context about what we’re trying to build and why in our launch post <a href="https://medium.com/nextjournal/launch-nextjournal-public-beta-for-open-research-a55d15bfa95f" rel="nofollow">https://medium.com/nextjournal/launch-nextjournal-public-bet...</a>

chrislloyd超过 7 年前

I’ve used Ohm for a few small parsers. What’s great about the editor is that you can take it and share it with somebody else and they’re given insights into _how_ the parser works.

dman超过 7 年前

Any thoughts on how to handle parsing for the IDE use case where a document is being edited that might have errors in it. I would usually expect an area around the cursor that is an area that receives edita and hence contains errors. I would also expect a header and footer surrounding the edited area that would be okay structurally since its unchanged from a previously sound definition of the file.

评论 #15492648 未加载

评论 #15492388 未加载

simplify超过 7 年前

I've had great experience using PEG.js, another PEG-based parser generator. How does Ohm compare?

评论 #15492457 未加载

richard_shelton超过 7 年前

I really liked what was done in STEPS project. I learned a lot from their repors. For example, this Ian Piumarta's paper is absolutely beautiful [1]. I also spent a lot of time learning oMeta [3] system by Alessandro Warth.And, honestly, now I see nothing really new in Ohm. Basically, it's just some tweaking of the same tech. Moreover, Ohm was made for isolated parsing task. For me it's a step back. My point is that the parsing alone is not very interesting thing, for making DSLs you need to have other tools too. In the Ian Piumarta's paper we had a minimalistic program transformation system [2]. Remember original META II [4]? It was a compiler-compiler (metacompiler), not just a parser generator. I'm really curious to know why the authors decided this time to limit themselves by only parsing.[1] <a href="http://www.vpri.org/pdf/tr2010003_PEG.pdf" rel="nofollow">http://www.vpri.org/pdf/tr2010003_PEG.pdf</a>[2] <a href="https://en.wikipedia.org/wiki/List_of_program_transformation_systems" rel="nofollow">https://en.wikipedia.org/wiki/List_of_program_transformation...</a>[3] <a href="http://www.vpri.org/pdf/tr2008003_experimenting.pdf" rel="nofollow">http://www.vpri.org/pdf/tr2008003_experimenting.pdf</a>[4] <a href="http://www.ibm-1401.info/Meta-II-schorre.pdf" rel="nofollow">http://www.ibm-1401.info/Meta-II-schorre.pdf</a>

评论 #15494280 未加载

评论 #15495831 未加载

derriz超过 7 年前

Sorry to be negative and this comment probably doesn't belong in a discussion about a specific parsing toolkit but I've become unconvinced that parser generators are useful. My experience is limited to Yacc/lex back in the old days (quickly jumped to Bison/flex), more recently Antlr and a couple of functional parser combinator libraries. In nearly all case it was to deal with "real world" (i.e. not toy) programming languages.The last time I needed a parser (in Java), I started studying the Antlr docs (it's changed quite a bit since I used it last) but became disillusioned quickly with the amount of reading and studying I would have to do to get something working.So I quickly wrote a "hand crafted" tokenizer and recursive descent parser. I found this so satisfying that it made me wonder why I had bothered learning relatively complex tools in the past particularly since I had been exposed to recursive descent parsing as an undergrad.Advantages that pop into my head:- The code was clean, readable and very concise. For debugging, the stacktraces were helpful and I could use my regular debugger/IDE to step through the parsing process. The method names in my Parser class mostly matched the names of corresponding grammar rules.- You can code around the theoretical limitations of recursive descent parsing in a very intuitive manner (e.g. "if (tokens.peekAhead(1).getType() == Token.LEFT_BRACE) { parseX(); } else { parseY(); }"). In theory it might seem this would lead to a mess but it actually allows very flexible and natural abstractions.- You have complete control over the building of the AST - the parseX(...) methods can take arguments or the calling parse method can manipulate the returned AST - doing stuff like flattening (normalising) node trees or re-ordering child nodes, etc. The shape of the AST can be independent of the structure of the grammar rules.- It's easy to provide helpful error messages and even error recovery without fighting with the toolkit. Better still, you can start with a fairly lazy generic error handler and later, in a natural style, add special cases to make the messages more and more helpful for specific common user mistakes. I sneakily logged all parse failures by users to constantly improve error reporting. After a while the parser seemed almost like an AI when reporting errors.- For parsing expressions, there is a relatively well-known way to deal with operators with different arities and associativity rules (by adding a numeric "context binding strength" parameter to your parseExpr() method) - a quick google provided the template.- The entire parser was self contained in a small number of reasonably compact classes: a Lexer/Tokenizer class, a Parser class and a SymbolTable class (and of course a TokenType enum and an ASTNode class). Other developers could grok the code because it was compact and self contained without having to learn a parsing toolkit.- You feel in control; i.e. you can add features to the language and the parser incrementally without fearing that sinking feeling you get when you think you're 99% of the way there only to realize that the tool you're using makes the last 1% impossible forcing you to rethink/rewrite already "banked" functionality.- Zero dependencies and trivial to integrate into the build and test process.edit: paragraphs

评论 #15493528 未加载

评论 #15493327 未加载

评论 #15497554 未加载

CalChris超过 7 年前

In many parser generators (e.g. Yacc and ANTLR), a grammar author can specify the language semantics by including semantic actions inside the grammar. A semantic action is a snippet of code — typically written in a different language —that produces a desired value or effect each time a particular rule is matched.Actually, the need for that went away with ANTLR4. The grammar is now all grammar (and lexer) and the semantic actions are listeners or walkers written separately calling or overriding methods and classes generated from the grammar.Much cleaner that way.

评论 #15492758 未加载

feelin_googley超过 7 年前

IMHO, nothing makes parsing as easy as snobol/spitbol. It is almost as old as lisp, and older than C.The question I have as a mere mortal user, who is not interested very much in theory and debates thereon, is what has the fastest performance?If the proponents of post-snobol PEG/packrat were to publish a "parsing challenge" and let us replicate/create benchmarks of different parsers, including some written in snobol, I would find that very useful in determining whether these other parsers are worth a more serious look.

kasbah超过 7 年前

I have been using Nearley.js [1] and have had a lot of fun using it. I actually quite liked being able to mix in the JS post-processing with the grammar definition in Nearley but could be convinced of the advantages of keeping the separate (checking out your paper on DSLs now).How would you compare it to Nearley? Can Ohm handle ambiguous grammars?[1]: <a href="http://nearley.js.org" rel="nofollow">http://nearley.js.org</a>

jsierles超过 7 年前

Pretty cool for sharing as it can run in the browser.So click on the 'Remix' button and you can play around with and run the article's contents.Is there a way to play with this using Node.js as well?

disconnected超过 7 年前

"Further reading" links at the bottom just link back to the same page.

评论 #15491851 未加载

tomp超过 7 年前

> The Ohm language is based on parsing expression grammars (PEGs), which are a formal way of describing syntax, similar to regular expressions and context-free grammarsUh-oh. I've voiced my concerns about PEGs (and LL parsers) before, but IMO any grammar "interpreter" that doesn't point out the ambiguities in grammar and instead relies on some vague, and ultimately arbitrary, notion of "precedence" (e.g. that rules declared first in the grammar file have priority), isn't a good foundation for a serious language (good for throwaway parsers and language experiments, though).

评论 #15492176 未加载

评论 #15492035 未加载

评论 #15495336 未加载

评论 #15492184 未加载

评论 #15492280 未加载

wybiral超过 7 年前

Those popup chats on articles like this gross me out...I'm just trying to read something, stop phishing for my email address.

评论 #15493745 未加载

ohm超过 7 年前

Nice