Fantastic job Johnathon. To HN readers, I just want to point out that this awesome library was created by a high school student (who is extremely motivated and builds a bunch of stuff)!<p>This is Hacker News at it's best. Highlighting creation.<p>Interesting to note that this was submitted a few days ago and received only 3 points - I mention this because this is the sort of thing that should have made the front page the first time it was posted!<p>I'm thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).<p>In fact I'm going to go and write it now (and create!).<p>Edit 1: If anyone is on Medium, here's my draft.<p><a href="https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km_postIds=e394f6d917d3" rel="nofollow">https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km...</a><p>Edit 2: And while I have the chance for something to not to fall to dead ears, here's something I just wrote that would be interesting to anyone who's annoyed with recruiters and would rather work at SpaceX than Snapchat.<p><a href="https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km_postIds=de5c73174a4e" rel="nofollow">https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km...</a>
Please note that this API does not make any specific attempts to obey the mediawiki etiquette (<a href="http://www.mediawiki.org/wiki/API:Etiquette" rel="nofollow">http://www.mediawiki.org/wiki/API:Etiquette</a>). This sort of API is easy and clean for something like a command line script, but if you're going to do further automation or crawling I strongly recommend using the pywikipediabot library (<a href="http://www.mediawiki.org/wiki/Manual:Pywikipediabot" rel="nofollow">http://www.mediawiki.org/wiki/Manual:Pywikipediabot</a>) which includes a very full API, has tunable throttling, and makes a more direct attempt to require a user agent string that is in line with the api etiquette.<p>If you just want a bash script to look things up on wikipedia, you can always use something like<p>function wp {
curl "<a href="http://en.wikipedia.org/wiki/$(echo" rel="nofollow">http://en.wikipedia.org/wiki/$(echo</a> "$@" | tr ' ' '_')" | gunzip | html2text
}<p>which will work for basic queries (needs url encoding and words to be properly capitalized).<p>A full api reference is here (<a href="http://en.wikipedia.org/w/api.php" rel="nofollow">http://en.wikipedia.org/w/api.php</a>).
Nice work on this. It's always good to see people giving more visibility to MediaWiki's capabilities.<p>To engineers who would like to use this library, I would give a caveat that there are way more API actions than the ones enumerated here. So if you're looking to make some contributions to a project, this one is rife with possible pull requests.<p>In terms of article access and analysis, I'd recommend looking at Pattern (<a href="https://github.com/clips/pattern" rel="nofollow">https://github.com/clips/pattern</a>) before starting with this library. Not only do you get access to the rest of Pattern's IR/text analysis capabilities, but the approach in Pattern is written to support any site built off of MediaWiki, and not just English Wikipedia. This means not only foreign-language Wikipedia instances, but all 350k wikis on Wikia, and Project Gutenberg (which, interestingly enough, runs MW 1.13).
Nice. I can see lots of room for improvement. Great start, especially using Requests will streamline things.<p>I'll work in some changes tonight. Let's start with PEP8, shall we? :)
Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that's easily processable by further tools in a single file. Thanks in advance.
I've used (and patched) this alternative: <a href="https://github.com/richardasaurus/wiki-api" rel="nofollow">https://github.com/richardasaurus/wiki-api</a>