TechEcho

11 comments

thomas536about 6 years ago

I also didn't find much information about how long it would take to import into a db, so I used the xml dumps directly [1]. I only needed the wiki content (not the history), so the article xml files worked well for me. And then I used mwparserfromhell [2] to parse and extract from the wiki markup.[1] <a href="https://dumps.wikimedia.org/enwiki/20190301/" rel="nofollow">https://dumps.wikimedia.org/enwiki/20190301/</a>[2] <a href="https://mwparserfromhell.readthedocs.io/en/latest/" rel="nofollow">https://mwparserfromhell.readthedocs.io/en/latest/</a>

评论 #19407149 未加载

digganabout 6 years ago

While building the Wikipedia mirror on IPFS (with search), we tried using the dumps from Wikipedia themselves but ended up using Zim archives from kiwix.org instead. The end result is here: <a href="https://github.com/ipfs/distributed-wikipedia-mirror" rel="nofollow">https://github.com/ipfs/distributed-wikipedia-mirror</a>For actually ingesting the archives, dignifiedquire expanded a Rust utility aptly named Zim, which you can find here <a href="https://github.com/dignifiedquire/zim" rel="nofollow">https://github.com/dignifiedquire/zim</a>Both repos contain information (and code of course) on how to extract information from the Zim archives

yk66about 6 years ago

I use Kiwix to do that. Much simpler. Plus they provide other dumps too. So allows you to play with say wikipedia and stackoverflow simultaneously.

评论 #19407728 未加载

arsenideabout 6 years ago

I have toyed around with the Wikipedia dump -- in XML, downloaded through the provided torrent file on Wikipedia.It took a bit to get accustomed to the format, but after looking at the files and doing a bit of research on the documentation, using Python with lxml made it relatively straightforward to do what I was interested in.I'd recommend doing the same, only because it worked for me: get the XML dump, manually check out some files to understand what is going on, search for documentation on the file format and maybe read a few blog posts, and then convert the XML files to data structures suited for what you're interested in.

aboutrubyabout 6 years ago

You could also use Special:Export depending on your use case: <a href="https://en.wikipedia.org/wiki/Special:Export" rel="nofollow">https://en.wikipedia.org/wiki/Special:Export</a>

thetermsheetabout 6 years ago

This may not be the most helpful reply but I remember having to use some "importing tool". Wikipedia provides you with standard SQL dumps yet simply importing them into the DB is not going to cut it. The community has created import scripts which simplify the process to a degree.

zepearlabout 6 years ago

I used Python to load the contents of the articles into a DB (potentially wrong extract of veeery old code - I have something like 20 different versions lying around therefore I'm not 100% sure that this did work well):===<pre><code> import xml.dom.pulldom as pulldom from lxml import etree from xml.etree import ElementTree as ET sInputFileName = "/my/input/wiki_file.xml" context = etree.iterparse(sInputFileName, events=('end',), tag='doc') for event, elem in context: iThisArticleCharLength = len(elem.text) sPageURL = elem.get("url")[0:4000] sPageTitle = elem.get("title")[0:4000] SPageContents = elem.text <do what you want with these vars...> </code></pre> ===

StrangeDoctorabout 6 years ago

I built tools to parse the compressed XML dumps. My computer was pretty underpowered at the time (MacBook air) so I had to very careful to make everything a streaming algorithm. Looking back I basically recreated a shitty map reduce in Python.

moossabout 6 years ago

I've had some success using this tutorial: <a href="https://www.kdnuggets.com/2017/11/building-wikipedia-text-corpus-nlp.html" rel="nofollow">https://www.kdnuggets.com/2017/11/building-wikipedia-text-co...</a> .And I've changed it a little bit to extract only the first n characters, this might be of some use since wikipedia dump are supposed to be pretty large: <a href="https://github.com/mooss/ruskea/blob/master/make_wiki_corpus.py" rel="nofollow">https://github.com/mooss/ruskea/blob/master/make_wiki_corpus...</a> .

kldavis4about 6 years ago

I wrote a simple parser in node to import the article dump into an Elasticsearch instance as a part of a hands on tutorial: <a href="https://github.com/kldavis4/kuali-days-2017-elasticsearch/blob/master/wikipedia/index.js" rel="nofollow">https://github.com/kldavis4/kuali-days-2017-elasticsearch/bl...</a>. At the time, on the full dump, it took quite a while to ingest (days as I recall).

usgroupabout 6 years ago

Dependent on what you’re doing, consider using wikidata instead. Has a SPARQL interface that’s easy to query.

11 comments

thomas536about 6 years ago

评论 #19407149 未加载

digganabout 6 years ago

yk66about 6 years ago

I use Kiwix to do that. Much simpler. Plus they provide other dumps too. So allows you to play with say wikipedia and stackoverflow simultaneously.

评论 #19407728 未加载

arsenideabout 6 years ago

aboutrubyabout 6 years ago

You could also use Special:Export depending on your use case: <a href="https://en.wikipedia.org/wiki/Special:Export" rel="nofollow">https://en.wikipedia.org/wiki/Special:Export</a>

thetermsheetabout 6 years ago

zepearlabout 6 years ago

StrangeDoctorabout 6 years ago

moossabout 6 years ago

kldavis4about 6 years ago

usgroupabout 6 years ago

Dependent on what you’re doing, consider using wikidata instead. Has a SPARQL interface that’s easy to query.

Ask HN: Has anyone had luck using the Wikipedia data dumps?

11 comments

Ask HN: Has anyone had luck using the Wikipedia data dumps?

11 comments