TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Parsing Wikipedia Articles with Node.js and jQuery

33 点作者 BenjaminCoe超过 12 年前

6 条评论

mkl超过 12 年前
There are lots of attempts to write new Wikipedia parsers that just do "the useful stuff", like getting the text. They all fail, for the simple reason that some of the text comes from MediaWiki templates.<p>E.g.<p><pre><code> about {{convert|55|km|0|abbr=on}} east of </code></pre> will turn into<p><pre><code> about 55 km (34 mi) east of </code></pre> and<p><pre><code> {{As of|2010|7|5}} </code></pre> will turn into<p><pre><code> As of 5 July 2010 </code></pre> and so on (there are thousands of relevant templates). It's simply not possible to get the full plain text without processing the templates, and the only system that can correctly and completely parse the templates is MediaWiki itself.<p>Yes it's a huge system entirely written in PHP, but you can make a simple command line parser with it pretty easily (though it took me quite a while to figure out how). The key points are to put something like<p><pre><code> $IP = strval(getenv('MW_INSTALL_PATH')) !== '' ? getenv('MW_INSTALL_PATH') : '/usr/share/mediawiki'; require_once("$IP/maintenance/commandLine.inc"); </code></pre> at the start of it, and then use the Parser class. You get HTML out, but it's simple and well-formed (to get text, start with the top level p tags).<p>To get it to process templates, get a Wikipedia dump, extract the templates, and use the mwdumper tool to import them into your local MediaWiki database.<p>I don't know if this is the best or "right" way to do it, but it's the only way I've found that actually works.
评论 #4439418 未加载
评论 #4439424 未加载
tillk超过 12 年前
This is interesting, but why not use their API?<p><a href="http://en.wikipedia.org/w/api.php" rel="nofollow">http://en.wikipedia.org/w/api.php</a><p>It's part of mediawiki and available for each and every wikipedia subsite – as far as I can tell. We are using this as well to autocomplete data. And it works really well.<p>I prefer this method over 'scraping' the content.
评论 #4437243 未加载
decad超过 12 年前
This is very interesting, although anyone aiming to crawl Wikipedia should make sure they read this section on the Database download page. <a href="http://en.wikipedia.org/wiki/Wikipedia:Database_download#Why_not_just_retrieve_data_from_wikipedia.org_at_runtime.3F" rel="nofollow">http://en.wikipedia.org/wiki/Wikipedia:Database_download#Why...</a><p>Everything should be fine as long as you respect their 1 request per second rule and their robots.txt
评论 #4436931 未加载
taliesinb超过 12 年前
For anyone who might find it useful, I wrote this really simple spidering tool in Go, which is useful when you just want a small subgraph of Wikipedia.<p><a href="https://github.com/taliesinb/wikispider" rel="nofollow">https://github.com/taliesinb/wikispider</a>
kenshiro_o超过 12 年前
That looks really good and neat! I am currently working on a project that uses information from Wikipedia articles and having a parser such as yours would make things a lot easier. I am currently on vacation for the next 2 weeks but I'd like to fork your project when I get back. Let me know if there is anything you need help with (bug fix or new features).
评论 #4437117 未加载
bsb超过 12 年前
There's also the dbpedia interfaces, including SPARQL access. NLP meet SemWeb, SemWeb meet NLP or have you met already?