Extracting data from Wikipedia using curl, grep, cut and other bash commands

218 pointsby loigealmost 9 years ago

21 comments

Because i was randomly curious, can extract this data from the structured html with some dom selectors in a similarly haphazard way:Start with: <a href="https://en.wikipedia.org/api/rest_v1/page/html/List_of_Olympic_medalists_in_judo" rel="nofollow">https://en.wikipedia.org/api/rest_v1/page/html/List_of_Olymp...</a>Run this js one-liner: [].slice.call(document.querySelectorAll('table[typeof="mw:Transclusion mw:ExpandedAttrs"] tr td:nth-child(n+2) > a:nth-child(1), table[typeof="mw:Transclusion mw:ExpandedAttrs"] tr:nth-child(3) td > a:nth-child(1)')).map(function(e) { return e.innerText; }).reduce(function(res,el) { res[el] = res[el] ? res[el] + 1 : 1; return res; }, {});The result is an object with the medalists as keys, and the count as values. JS objects are unordered so sorting is left as an excercise for the reader.

评论 #12296407 未加载

评论 #12293414 未加载

jpatokalalmost 9 years ago

This is you-can't-parse-HTML-with-regex [1] level hideous, only worse, because Mediawiki markup is essentially a Turing-complete programming language thanks to template inclusion, parser functions [2], etc.The only remotely sane way to do this is to use the Mediawiki API [3] to get the pages you want, then use an actual parser like mwlib [4] to extract the content you need. Wikidata and DBpedia are also promising efforts, but both have a long way to go in terms of coverage.[1] <a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags" rel="nofollow">http://stackoverflow.com/questions/1732348/regex-match-open-...</a>[2] <a href="https://www.mediawiki.org/wiki/Help:Extension:ParserFunctions" rel="nofollow">https://www.mediawiki.org/wiki/Help:Extension:ParserFunction...</a>[3] <a href="https://www.mediawiki.org/wiki/API:Main_page" rel="nofollow">https://www.mediawiki.org/wiki/API:Main_page</a>[4] <a href="https://www.mediawiki.org/wiki/Alternative_parsers" rel="nofollow">https://www.mediawiki.org/wiki/Alternative_parsers</a>

评论 #12296010 未加载

评论 #12295263 未加载

betolinkalmost 9 years ago

...or we can just use SPARQL and dbpedia!(<a href="http://wiki.dbpedia.org/" rel="nofollow">http://wiki.dbpedia.org/</a>) There are questions where you'll have to scrap more than one page to get an answer and things could get really complicated with shell commands.dbpedia is a triple-store that allows us to perform simple queries against wikipedia data like listing music bands based on a particular city:<pre><code> SELECT ?name ?place WHERE { ?place rdfs:label "Denver"@en . ?band dbo:hometown ?place . ?band rdf:type dbo:Band . ?band rdfs:label ?name . FILTER langMatches(lang(?name),'en') } </code></pre> or queries that involve multiple subjects, categories etc.

评论 #12294541 未加载

评论 #12294736 未加载

minimaxiralmost 9 years ago

> You will not need to open an editor and write a long script and then to have an interpreter like Node.js to run it, sometime the bash command line is just enough you need!This is a bad attitude to have for working with data processing, where QA is necessary and the accuracy of the output is important. A 50 LOC scraper with comments and explicitly-defined inputs and output from functions is far preferable to a 8 LOC scraper that those without bash knowledge will be unable to parse.And the 8 LOC bash script is not much of a time savings as this post demonstrates; you still have to check each function output manually to find which data to parse / handle edge cases.

评论 #12292976 未加载

评论 #12293756 未加载

评论 #12292961 未加载

评论 #12292925 未加载

评论 #12294362 未加载

评论 #12293241 未加载

ianseyeralmost 9 years ago

This is the kind of query that excites me for WikiData's development.<a href="http://wikidata.org" rel="nofollow">http://wikidata.org</a>

评论 #12293052 未加载

Washuualmost 9 years ago

There is the other option to use Parsoid.<a href="https://github.com/wikimedia/parsoid" rel="nofollow">https://github.com/wikimedia/parsoid</a>That is MediaWiki's official off wiki parser that can turn wikitext into HTML or HTML back into wikitext. It would be reasonably simple to hook into its API and use it for data extraction instead.

评论 #12294656 未加载

orfixalmost 9 years ago

My 2 cents: the cut/grep lines could be replaced by a sed/awk one-liner such as:sed -n 's/.flagIOCmedalist|\[\[\([^]|]\).*/\1/p'

评论 #12294072 未加载

CydeWeysalmost 9 years ago

There is an active project sponsored by the Wikimedia Foundation called PyWikiBot that I've been a contributor to and user of for over a decade now. If you want to do anything and everything with Wikipedia, look no further than: <a href="https://github.com/wikimedia/pywikibot-core" rel="nofollow">https://github.com/wikimedia/pywikibot-core</a>

mickael-kerjeanalmost 9 years ago

You should try wikidata for any type of query that can't be answer using google and where all the information itself is already on wikipedia. it's way faster (if you know about sparql) and way more powerfull and flexible. It only seems surprising there isn't more people talking about it, triplestore are awesome

davidgerardalmost 9 years ago

... there's an API making half of this superfluous. You can do pretty much any MediaWiki reading or writing through it. (All Wikipedia bots are required to use it, for instance.)<a href="https://en.wikipedia.org/w/api.php" rel="nofollow">https://en.wikipedia.org/w/api.php</a>The article text is a raw blob of wikitext you have to process, but you don't have to go to stupid lengths trying to parse HTML without a browser.

评论 #12296213 未加载

tpetricekalmost 9 years ago

Extracting data from Wikipedia with type providers: <a href="http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/" rel="nofollow">http://evelinag.com/blog/2015/11-18-f-tackles-james-bond/</a>

turtlebitsalmost 9 years ago

An xpath like `//table/tr/td[2]/a[1]/text()` seems like it would be a lot simpler.

评论 #12293936 未加载

kaspersetalmost 9 years ago

Large part of Bioinformatics data processing involves these commands. They seem little cryptic but gets job done. I would also like to mention Datamash: <a href="https://www.gnu.org/software/datamash/" rel="nofollow">https://www.gnu.org/software/datamash/</a>

hbogertalmost 9 years ago

Isn't this a poster child example for the semantic web?

ShakataGaNaialmost 9 years ago

errrrrrrrrk. Extracting raw wiki-markup and trying to use it? Not the greatest of idea. The only true parser of that language is mediawiki. Doing it yourself is a recipe for a massive headache.

评论 #12295126 未加载

yarrelalmost 9 years ago

A couple of years ago I found Perl was fastest at processing Wikipedia dumps.It also didn't require having a JVM preloaded to make startup times acceptable during development (naming no other tools).I do use shell tools to process data, a lot. They're particularly good for exploratory programming and initial analysis of new datasets.

评论 #12295383 未加载

dangravellalmost 9 years ago

Or, for a lot of the structured elements, you could use DBPedia.

vram22almost 9 years ago

There is also a wikipedia library for Python. An example of its use:Using the wikipedia Python library (to search for oranges :)<a href="https://jugad2.blogspot.in/2015/11/using-wikipedia-python-library.html" rel="nofollow">https://jugad2.blogspot.in/2015/11/using-wikipedia-python-li...</a>And there maybe libraries for other languages too, since the above library wraps a Wikipedia API:<a href="https://en.wikipedia.org/wiki/Wikipedia:API" rel="nofollow">https://en.wikipedia.org/wiki/Wikipedia:API</a>

loigealmost 9 years ago

I actually added some of your alternative solutions to the bottom of the article, thanks for commenting :)

lovelearningalmost 9 years ago

Python has excellent packages like mwparserfromhell and wikitables for this kind of processing.

opensourcedudealmost 9 years ago

I appreciate this for the novelty factor, but, somebody show this dude how to use a spreadsheet!

评论 #12293069 未加载