科技回声

8 条评论

haberman将近 10 年前

I would recommend that the author read up on NFAs and DFAs -- they are a formalism better suited to lexers than tries.At a high level, if compiling a lexer for a run-of-the-mill language like JavaScript takes 5 minutes and 2.5GB of RAM, you are most likely doing it wrong. By "doing it wrong," I mean that there is almost certainly a better approach that is far superior in every measurable aspect (CPU, memory, code size, etc).I don't fully understand what kind of algorithm the author was using, so I can't comment in detail on it, but in general lexers are better thought of as finite automata (NFAs and DFAs) than tries. The two are related, but unlike tries NFAs and DFAs can have cycles, which are essential for representing a lexing state machine efficiently.Another observation: it's not too terribly surprising that you could beat preg_match_all() with the latter being given a huge regular expression. Most regex engines are backtracking (RE2 being a notable counterexample), which means that a regular expression with high-cardinality alternation (ie. A|B|C|D|E|F|G ...etc) is one of their worst cases. This isn't what they are designed for. A backtracking regex engine will basically try each alternative in sequence until that alternative no longer matches, then back up and try the next one. NFA/DFA based parsers will be much faster for this case.The right tool for this job, as another commenter mentioned, is Ragel. It's basically designed for exactly this. It doesn't appear to support PHP though...

评论 #9561963 未加载

djoldman将近 10 年前

Want to build an ultra-fast lexer? Ragel is the way.<a href="http://en.wikipedia.org/wiki/Ragel" rel="nofollow">http://en.wikipedia.org/wiki/Ragel</a><a href="http://www.colm.net/open-source/ragel/" rel="nofollow">http://www.colm.net/open-source/ragel/</a>

评论 #9556946 未加载

bane将近 10 年前

Tries are amazing data structures, simple and extraordinarily fast -- O(m) on look ups. But they also eat memory at extraordinary rates as well. They're a classic speed vs. memory data structure.However, most people use naive Tries, just adding elements down a branch until they exhaust the string they're inserting.One easy optimization to make with Tries is to set a maximum branch length (based on some statistical analysis of lexeme usage. For example, make 90% of your lookups reachable under that length), any lexeme longer than that length simply gets hung off of the end of the branch in a more space-efficient data structure (like a hash table).Your lookup then is then still O(m) for anything under the maximum branch length, and things longer are still just O(m)+O(n) or whatever.But your memory usage will shrink dramatically. And you can improve it by fiddling with your branch length and choose say 80% reachable without hashing.

评论 #9557728 未加载

评论 #9557893 未加载

评论 #9557584 未加载

bluetech将近 10 年前

[From all the bad things I hear about PHP, the code is very readble without any previous experience - nice].Here are some things a lexer for a programming language might have to deal with:1. Comments (some even do nested - which means regular expressions are out for that).2. Continuation lines.3. Includes (if done at the lexical level).4. Filename/line/column number for nice error messages (can really hurt with branch mispredictions).5. Evaluation of literals: decimal/hex/octal/binary integers, floats, strings (with escapes), etc.6. Identifiers.So matching keywords is mostly the straightforward part. However I have found that matching many keywords is the perfect (and in my experience so far, the only) use case for a perfect hashing tool like gperf - it would normally be much faster than any pointer-chasing trie. gperf mostly elminated keyword matching from the profile of any lexer I've done.

评论 #9557589 未加载

lindig将近 10 年前

A lexer for a language with a lot of keywords leads to a large representation as an automaton as the author has experienced. One way to deals with this is to only recognise in one rule all identifiers including keywords (something like "[a-z_][a-zA-Z0-9_]*" and to use a hash table of keywords to check whether a match is a keyword (and which one) or an identifier .Edit: fixed the regexp to allow for single-char identifiers.

评论 #9557097 未加载

jakobegger将近 10 年前

So I get that this optimized giant trie might be faster than a regex. But what about a normal lexer, either handwritten or generated? Shouldn't that be faster still than the trie? I mean, that giant amount of memory alone must cause lots of performance issues...

lolptdr将近 10 年前

"Parsers work at the grammatical level, lexers work at the word level."Is this correct to say?

评论 #9556653 未加载

评论 #9557081 未加载

评论 #9556835 未加载

评论 #9558103 未加载

评论 #9556734 未加载

TheLoneWolfling将近 10 年前

At least optimize the DFA before you run it...The equivalent of a DAWG versus a Trie.

8 条评论

haberman将近 10 年前

评论 #9561963 未加载

djoldman将近 10 年前

评论 #9556946 未加载

bane将近 10 年前

评论 #9557728 未加载

评论 #9557893 未加载

评论 #9557584 未加载

bluetech将近 10 年前

评论 #9557589 未加载

lindig将近 10 年前

评论 #9557097 未加载

jakobegger将近 10 年前

lolptdr将近 10 年前

"Parsers work at the grammatical level, lexers work at the word level."Is this correct to say?

评论 #9556653 未加载

评论 #9557081 未加载

评论 #9556835 未加载

评论 #9558103 未加载

评论 #9556734 未加载

TheLoneWolfling将近 10 年前

At least optimize the DFA before you run it...The equivalent of a DAWG versus a Trie.

Tries and Lexers

8 条评论

Tries and Lexers

8 条评论