There are lots of attempts to write new Wikipedia parsers that just do "the useful stuff", like getting the text. They all fail, for the simple reason that some of the text comes from MediaWiki templates.<p>E.g.<p><pre><code> about {{convert|55|km|0|abbr=on}} east of
</code></pre>
will turn into<p><pre><code> about 55 km (34 mi) east of
</code></pre>
and<p><pre><code> {{As of|2010|7|5}}
</code></pre>
will turn into<p><pre><code> As of 5 July 2010
</code></pre>
and so on (there are thousands of relevant templates). It's simply not possible to get the full plain text without processing the templates, and the only system that can correctly and completely parse the templates is MediaWiki itself.<p>Yes it's a huge system entirely written in PHP, but you can make a simple command line parser with it pretty easily (though it took me quite a while to figure out how). The key points are to put something like<p><pre><code> $IP = strval(getenv('MW_INSTALL_PATH')) !== ''
? getenv('MW_INSTALL_PATH')
: '/usr/share/mediawiki';
require_once("$IP/maintenance/commandLine.inc");
</code></pre>
at the start of it, and then use the Parser class. You get HTML out, but it's simple and well-formed (to get text, start with the top level p tags).<p>To get it to process templates, get a Wikipedia dump, extract the templates, and use the mwdumper tool to import them into your local MediaWiki database.<p>I don't know if this is the best or "right" way to do it, but it's the only way I've found that actually works.