Parsoid in PHP, or There and Back Again

87 点作者 sthottingal超过 5 年前

10 条评论

Reintegrating the parser into Mediawiki's PHP core goes beyond performance. Many prominent MW features -- particularly the visual editor, translation functions, and mobile endpoints -- depend heavily on Parsoid/JS, which required running it as a Node.js microservice, something not all smaller (or especially shared-hosting wikis) could manage for quite some time.Bringing Parsoid closer to core makes it easier for non-Wikimedia admins to use more of MW's modern core features. The performance gains are a nice bonus, and I suspect are related to the general improvements into PHP 7.2.

评论 #22316977 未加载

mhd超过 5 年前

> We will (re)examine some of the reasons behind this architectural choice in a future blog post, but the loose coupling of Parsoid and the MediaWiki core allowed rapid iteration in what was originally a highly experimental effort to support visual editing.I'd really like to read that. The decision to have this parser as a completely separate component is the main reason why a lot of local MediaWiki installations completely avoided having a visual editor -- which in turn probably created lots of missing hours and/or missing documentation because WikiCreole ain't exactly a thing of beauty or something that's used in other places (as opposed to Markdown, which is an ugly beast, too, but at least the ugly beast you know).You need a heavy JS frontend for a visual editor anyway, so why not do it client-side?Having to deploy a separate component, probably in an environment that's not used at all is pretty much the worst choice possible. Yes, I'm aware, you readers here probably do all kinds of hip docker/keights setups where yet another node microservice ain't nothing special (and should've been rewritten in Rust, of course), but a wiki is something on a different level of ubiquity.

cscottnet超过 5 年前

I'm on the team. Part 2 of this post series should have lots of interesting technical details for y'all; be patient, I'm still writing it.But to whet your appetite: we used <a href="https://github.com/cscott/js2php" rel="nofollow">https://github.com/cscott/js2php</a> to generate a "crappy first draft" of the PHP code for our JS source. Not going for correctness, instead trying to match code style and syntax changes so that we could more easily review git diffs from the crappy first draft to the "working" version, and concentrate attention on the important bits, not the boring syntax-change-y parts.The original legacy Mediawiki parser used a big pile of regexps and had all sorts of corner cases caused by the particular order in which the regexps were applied, etc.Parsoid uses a PEG tokenizer, written with pegjs (we wrote a PHP backend to pegjs for this project). There are still a bunch of regexps scattered throughout the code, because they are still very useful for text processing and a valuable feature of both JavaScript and PHP as programming languages, but they are not the primary parsing mechanism. Translating the regexps was actually one of the more difficult parts, because there are some subtle differences between JS and PHP regexps.We made a deliberate choice to switch from JS-style loose typing to strict typing in the PHP port. Whatever you may consider the long term merits are for maintainability, programming-in-the-large, etc, they were extremely useful for the porting project itself, since they caught a bunch of non-obvious problems where the types of things were slightly different in PHP and JS. JS used anonymous objects all over the place; we used PHP associative arrays for many of these places, but found it very worthwhile to take the time to create proper typed classes during the translation where possible; it really helped clarify the interfaces and, again, catch a lot of subtle impedance mismatches during the port.We tried to narrow scope by not converting every loose interface or anonymous object to a type -- we actually converted as many things as possible to proper JS classes in the "pregame" before the port, but the important thing was to get the port done and complete as quickly as possible. We'll be continuing to tighten the type system -- as much for code documentation as anything else -- as we address code debt moving forward.AMA, although I don't check hacker news frequently so I can't promise to reply.

评论 #22326046 未加载

评论 #22320237 未加载

lolphp111超过 5 年前

Why? The editor needs a frontend in javascript anyways, so why mot handle this all in real time on the client?Now they rewrote in PHP, thats probably one of the worst languages out there, and why not rewrite in something compiled if speed was the main reason for a rewrite?For me PHP sits in the middle as a poor language, and still slow compared to any compiled languages. Also i would want to see some wasm vs php benchmarks they did before starting with php.Lots of poor decisions from the wiki team.

评论 #22339035 未加载

评论 #22325210 未加载

评论 #22339000 未加载

tjpnz超过 5 年前

>Parsoid/JS had very few unit tests focused on specific subsections of code. With only integration tests we would find it difficult to test anything but a complete and finished port.I found this a little frightening given Parsoid/JS is handling user input.

评论 #22325886 未加载

znpy超过 5 年前

Last time i tried getting mediawiki up and running as a personal wiki i found out that getting parsoid working was quite a mess. hopefully now it will be easier to get a fully fledged wikimedia installation, together with a visual editor.

评论 #22325893 未加载

TicklishTiger超过 5 年前

I would assume that the code is open source?For some reason, I did not manage to find it. Neither linked from this article, nor via the MediaWiki page:<a href="https://www.mediawiki.org/wiki/Parsoid" rel="nofollow">https://www.mediawiki.org/wiki/Parsoid</a>Nor via the Phabricator page:<a href="https://phabricator.wikimedia.org/project/profile/487/" rel="nofollow">https://phabricator.wikimedia.org/project/profile/487/</a>What am I missing?

评论 #22316282 未加载

harryf超过 5 年前

Years ago I wrote the re-wrote wiki parser for Dokuwiki (which is used at <a href="https://wiki.php.net/" rel="nofollow">https://wiki.php.net/</a> among other places). Originally the parser was scanning a wiki page multiple times using various regular expressions. I used a stack machine as a way to manage the regular expressions, which resulted in being able to parse a page in a single pass - it's documented here - <a href="https://www.dokuwiki.org/devel:parser" rel="nofollow">https://www.dokuwiki.org/devel:parser</a>A nice (unexpected) side effect is it became much easier for people extend the parser which their own syntax, leading to an explosion of plugins ( <a href="https://www.dokuwiki.org/plugins?plugintype=1#extension__table" rel="nofollow">https://www.dokuwiki.org/plugins?plugintype=1#extension__tab...</a> )I'm no expert on parsing theory but I have the impression that applying standard approaches to parsing source code; building syntax trees, attempting to express it with context free grammar etc. is the wrong approach for parsing wiki markup because it's context-sensitive. There's some discussion of the problem here <a href="https://www.mediawiki.org/wiki/Markup_spec#Feasibility_study" rel="nofollow">https://www.mediawiki.org/wiki/Markup_spec#Feasibility_study</a>Another challenge for wiki markup, from a usability perspective, if a user get's part of the syntax of a page "wrong", you need to show them the end result so they can fix the problem, rather than have the entire page "fail" with a syntax error.From looking at many wiki parsers before re-writing the Dokuwiki parser, what _tends_ to be the case, when people try to apply context-free grammars or build syntax trees is they reach 80% then stumble at the remain 20% of edge cases of how wiki markup is actually used in the wild.Instead of building an object graph, the Dokuwiki parser produces a simple flat array representing the source page ( <a href="https://www.dokuwiki.org/devel:parser#token_conversion" rel="nofollow">https://www.dokuwiki.org/devel:parser#token_conversion</a> ) which I'd argue makes is simpler write code for rendering output (hence lots of plugins) as well as being more robust at handling "bad" wiki markup it might encounter in the wild - less chance of some kind of infinite recursion or similar.Ultimately it's similar discussion to the SAX vs. DOM discussions people used to have around XML parsing ( <a href="https://stackoverflow.com/questions/6828703/what-is-the-difference-between-sax-and-dom" rel="nofollow">https://stackoverflow.com/questions/6828703/what-is-the-diff...</a> ). From a glance at the Parsiod source they seem to be taking a DOM-like approach - I wish them luck with that - my experience was this will probably lead to a great deal more complexity, especially when it comes to edge cases.

评论 #22325938 未加载

TicklishTiger超过 5 年前

Anybody else feeling that strict typing and long var names are not worth all the visual overload?Example: <a href="https://github.com/wikimedia/parsoid/blob/master/src/Parsoid.php#L235" rel="nofollow">https://github.com/wikimedia/parsoid/blob/master/src/Parsoid...</a>This is how I would write the function definition:<pre><code> function html2wikitext($config, $html, $options = [], $data = null) </code></pre> This how Wikimedia did it:<pre><code> public function html2wikitext( PageConfig $pageConfig, string $html, array $options = [], ?SelserData $selserData = null ): string </code></pre> I see this "strictness over readability" on the rise in many places and I think it is a net negative.Not totally sure, but this seems to be the old JS function definition:<a href="https://github.com/abbradar/parsoid/blob/master/lib/parse.js#L70" rel="nofollow">https://github.com/abbradar/parsoid/blob/master/lib/parse.js...</a>A bit cryptic and it suffers from the typical promise / async / callback / nesting horror so common in the Javascript world:<pre><code> _html2wt = Promise.async(function *(obj, env, html, pb)</code></pre>

评论 #22316993 未加载

评论 #22316793 未加载

评论 #22316515 未加载

评论 #22317936 未加载

评论 #22317531 未加载

评论 #22316411 未加载

评论 #22316768 未加载

评论 #22316794 未加载

评论 #22318203 未加载

echelon超过 5 年前

I'm actually curious why PHP was chosen instead of Rust or Go given that the parsing team wasn't familiar with the language. I understand that MediaWiki is written in PHP, but it sounds like they were already comfortable with language heterogeny.They claim,> The two wikitext engines were different in terms of implementation language, fundamental architecture, and modeling of wikitext semantics (how they represented the "meaning" of wikitext). These differences impacted development of new features as well as the conversation around the evolution of wikitext and templating in our projects. While the differences in implementation language and architecture were the most obvious and talked-about issues, this last concern -- platform evolution -- is no less important, and has motivated the careful and deliberate way we have approached integration of the two engines.Which is I suppose a compelling reason for a rewrite if you're understaffed.I'd still be interested in writing it in Rust and then writing PHP bindings. There's even a possibility of running a WASM engine in the browser and skipping the roundtrip for evaluation.