Because i was randomly curious, can extract this data from the structured html with some dom selectors in a similarly haphazard way:<p>Start with: <a href="https://en.wikipedia.org/api/rest_v1/page/html/List_of_Olympic_medalists_in_judo" rel="nofollow">https://en.wikipedia.org/api/rest_v1/page/html/List_of_Olymp...</a><p>Run this js one-liner: [].slice.call(document.querySelectorAll('table[typeof="mw:Transclusion mw:ExpandedAttrs"] tr td:nth-child(n+2) > a:nth-child(1), table[typeof="mw:Transclusion mw:ExpandedAttrs"] tr:nth-child(3) td > a:nth-child(1)')).map(function(e) { return e.innerText; }).reduce(function(res,el) { res[el] = res[el] ? res[el] + 1 : 1; return res; }, {});<p>The result is an object with the medalists as keys, and the count as values. JS objects are unordered so sorting is left as an excercise for the reader.
This is you-can't-parse-HTML-with-regex [1] level hideous, only worse, because Mediawiki markup is essentially a Turing-complete programming language thanks to template inclusion, parser functions [2], etc.<p>The only remotely sane way to do this is to use the Mediawiki API [3] to get the pages you want, then use an actual parser like mwlib [4] to extract the content you need. Wikidata and DBpedia are also promising efforts, but both have a long way to go in terms of coverage.<p>[1] <a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags" rel="nofollow">http://stackoverflow.com/questions/1732348/regex-match-open-...</a><p>[2] <a href="https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions" rel="nofollow">https://www.mediawiki.org/wiki/Help:Extension:ParserFunction...</a><p>[3] <a href="https://www.mediawiki.org/wiki/API:Main_page" rel="nofollow">https://www.mediawiki.org/wiki/API:Main_page</a><p>[4] <a href="https://www.mediawiki.org/wiki/Alternative_parsers" rel="nofollow">https://www.mediawiki.org/wiki/Alternative_parsers</a>
...or we can just use SPARQL and dbpedia!(<a href="http://wiki.dbpedia.org/" rel="nofollow">http://wiki.dbpedia.org/</a>) There are questions where you'll have to scrap more than one page to get an answer and things could get really complicated with shell commands.<p>dbpedia is a triple-store that allows us to perform simple queries against wikipedia data like listing music bands based on a particular city:<p><pre><code> SELECT ?name ?place
WHERE {
?place rdfs:label "Denver"@en .
?band dbo:hometown ?place .
?band rdf:type dbo:Band .
?band rdfs:label ?name .
FILTER langMatches(lang(?name),'en')
}
</code></pre>
or queries that involve multiple subjects, categories etc.
> You will not need to open an editor and write a long script and then to have an interpreter like Node.js to run it, sometime the bash command line is just enough you need!<p>This is a bad attitude to have for working with <i>data processing</i>, where QA is necessary and the accuracy of the output is important. A 50 LOC scraper with comments and explicitly-defined inputs and output from functions is far preferable to a 8 LOC scraper that those without bash knowledge will be unable to parse.<p>And the 8 LOC bash script is not much of a time savings as this post demonstrates; you still have to check each function output manually to find which data to parse / handle edge cases.
This is the kind of query that excites me for WikiData's development.<p><a href="http://wikidata.org" rel="nofollow">http://wikidata.org</a>
There is the other option to use Parsoid.<p><a href="https://github.com/wikimedia/parsoid" rel="nofollow">https://github.com/wikimedia/parsoid</a><p>That is MediaWiki's official off wiki parser that can turn wikitext into HTML or HTML back into wikitext. It would be reasonably simple to hook into its API and use it for data extraction instead.
My 2 cents: the cut/grep lines could be replaced by a sed/awk one-liner such as:<p>sed -n 's/.<i>flagIOCmedalist|\[\[\([^]|]</i>\).*/\1/p'
There is an active project sponsored by the Wikimedia Foundation called PyWikiBot that I've been a contributor to and user of for over a decade now. If you want to do anything and everything with Wikipedia, look no further than: <a href="https://github.com/wikimedia/pywikibot-core" rel="nofollow">https://github.com/wikimedia/pywikibot-core</a>
You should try wikidata for any type of query that can't be answer using google and where all the information itself is already on wikipedia.
it's way faster (if you know about sparql) and way more powerfull and flexible. It only seems surprising there isn't more people talking about it, triplestore are awesome
... there's an API making half of this superfluous. You can do pretty much any MediaWiki reading or writing through it. (All Wikipedia bots are required to use it, for instance.)<p><a href="https://en.wikipedia.org/w/api.php" rel="nofollow">https://en.wikipedia.org/w/api.php</a><p>The article text is a raw blob of wikitext you have to process, but you don't have to go to stupid lengths trying to parse HTML without a browser.
Extracting data from Wikipedia with type providers: <a href="http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/" rel="nofollow">http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/</a>
Large part of Bioinformatics data processing involves these commands. They seem little cryptic but gets job done. I would also like to mention Datamash: <a href="https://www.gnu.org/software/datamash/" rel="nofollow">https://www.gnu.org/software/datamash/</a>
errrrrrrrrk. Extracting raw wiki-markup and trying to use it? Not the greatest of idea. The only true parser of that language is mediawiki. Doing it yourself is a recipe for a massive headache.
A couple of years ago I found Perl was fastest at processing Wikipedia dumps.<p>It also didn't require having a JVM preloaded to make startup times acceptable during development (naming no other tools).<p>I do use shell tools to process data, a lot. They're particularly good for exploratory programming and initial analysis of new datasets.
There is also a wikipedia library for Python. An example of its use:<p>Using the wikipedia Python library (to search for oranges :)<p><a href="https://jugad2.blogspot.in/2015/11/using-wikipedia-python-library.html" rel="nofollow">https://jugad2.blogspot.in/2015/11/using-wikipedia-python-li...</a><p>And there maybe libraries for other languages too, since the above library wraps a Wikipedia API:<p><a href="https://en.wikipedia.org/wiki/Wikipedia:API" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:API</a>