TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Easy to use Wikipedia API for Python

218 pointsby jgoldsmithover 11 years ago

10 comments

cjbarberover 11 years ago
Fantastic job Johnathon. To HN readers, I just want to point out that this awesome library was created by a high school student (who is extremely motivated and builds a bunch of stuff)!<p>This is Hacker News at it&#x27;s best. Highlighting creation.<p>Interesting to note that this was submitted a few days ago and received only 3 points - I mention this because this is the sort of thing that should have made the front page the first time it was posted!<p>I&#x27;m thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).<p>In fact I&#x27;m going to go and write it now (and create!).<p>Edit 1: If anyone is on Medium, here&#x27;s my draft.<p><a href="https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&amp;km_postIds=e394f6d917d3" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;p&#x2F;e394f6d917d3?kme=collabEmail.clicked&amp;km...</a><p>Edit 2: And while I have the chance for something to not to fall to dead ears, here&#x27;s something I just wrote that would be interesting to anyone who&#x27;s annoyed with recruiters and would rather work at SpaceX than Snapchat.<p><a href="https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&amp;km_postIds=de5c73174a4e" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;p&#x2F;de5c73174a4e?kme=collabEmail.clicked&amp;km...</a>
评论 #6280153 未加载
评论 #6280121 未加载
nevermoreover 11 years ago
Please note that this API does not make any specific attempts to obey the mediawiki etiquette (<a href="http://www.mediawiki.org/wiki/API:Etiquette" rel="nofollow">http:&#x2F;&#x2F;www.mediawiki.org&#x2F;wiki&#x2F;API:Etiquette</a>). This sort of API is easy and clean for something like a command line script, but if you&#x27;re going to do further automation or crawling I strongly recommend using the pywikipediabot library (<a href="http://www.mediawiki.org/wiki/Manual:Pywikipediabot" rel="nofollow">http:&#x2F;&#x2F;www.mediawiki.org&#x2F;wiki&#x2F;Manual:Pywikipediabot</a>) which includes a very full API, has tunable throttling, and makes a more direct attempt to require a user agent string that is in line with the api etiquette.<p>If you just want a bash script to look things up on wikipedia, you can always use something like<p>function wp { curl &quot;<a href="http://en.wikipedia.org/wiki/$(echo" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;$(echo</a> &quot;$@&quot; | tr &#x27; &#x27; &#x27;_&#x27;)&quot; | gunzip | html2text }<p>which will work for basic queries (needs url encoding and words to be properly capitalized).<p>A full api reference is here (<a href="http://en.wikipedia.org/w/api.php" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;w&#x2F;api.php</a>).
评论 #6280208 未加载
评论 #6280724 未加载
评论 #6280118 未加载
languagehackerover 11 years ago
Nice work on this. It&#x27;s always good to see people giving more visibility to MediaWiki&#x27;s capabilities.<p>To engineers who would like to use this library, I would give a caveat that there are way more API actions than the ones enumerated here. So if you&#x27;re looking to make some contributions to a project, this one is rife with possible pull requests.<p>In terms of article access and analysis, I&#x27;d recommend looking at Pattern (<a href="https://github.com/clips/pattern" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;clips&#x2F;pattern</a>) before starting with this library. Not only do you get access to the rest of Pattern&#x27;s IR&#x2F;text analysis capabilities, but the approach in Pattern is written to support any site built off of MediaWiki, and not just English Wikipedia. This means not only foreign-language Wikipedia instances, but all 350k wikis on Wikia, and Project Gutenberg (which, interestingly enough, runs MW 1.13).
echohackover 11 years ago
Nice. I can see lots of room for improvement. Great start, especially using Requests will streamline things.<p>I&#x27;ll work in some changes tonight. Let&#x27;s start with PEP8, shall we? :)
评论 #6280332 未加载
DenisMover 11 years ago
Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that&#x27;s easily processable by further tools in a single file. Thanks in advance.
评论 #6281615 未加载
toygover 11 years ago
I&#x27;ve used (and patched) this alternative: <a href="https://github.com/richardasaurus/wiki-api" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;richardasaurus&#x2F;wiki-api</a>
harlowjaover 11 years ago
You might really want to not cache everything coming back into a never ending python dictionary. Lookup memory leak on wikipedia ;)
评论 #6284521 未加载
level09over 11 years ago
Excellent !<p>Are there any apis for other languages ? tried to query using unicode strings and it worked but I only got English content.
评论 #6280658 未加载
评论 #6280096 未加载
评论 #6280241 未加载
评论 #6280098 未加载
ksrmover 11 years ago
Is there an API for extracting data from infoboxes?
leoplctover 11 years ago
Great! I looking forward for a Ruby version!
评论 #6280110 未加载
评论 #6280271 未加载
评论 #6280661 未加载