TechEcho

10 comments

cjbarberover 11 years ago

Fantastic job Johnathon. To HN readers, I just want to point out that this awesome library was created by a high school student (who is extremely motivated and builds a bunch of stuff)!This is Hacker News at it's best. Highlighting creation.Interesting to note that this was submitted a few days ago and received only 3 points - I mention this because this is the sort of thing that should have made the front page the first time it was posted!I'm thinking of writing a weekly post that highlights the things people created and posted to hacker news to deaf ears (thankfully this is not the case with this post!).In fact I'm going to go and write it now (and create!).Edit 1: If anyone is on Medium, here's my draft.<a href="https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km_postIds=e394f6d917d3" rel="nofollow">https://medium.com/p/e394f6d917d3?kme=collabEmail.clicked&km...</a>Edit 2: And while I have the chance for something to not to fall to dead ears, here's something I just wrote that would be interesting to anyone who's annoyed with recruiters and would rather work at SpaceX than Snapchat.<a href="https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km_postIds=de5c73174a4e" rel="nofollow">https://medium.com/p/de5c73174a4e?kme=collabEmail.clicked&km...</a>

评论 #6280153 未加载

评论 #6280121 未加载

nevermoreover 11 years ago

Please note that this API does not make any specific attempts to obey the mediawiki etiquette (<a href="http://www.mediawiki.org/wiki/API:Etiquette" rel="nofollow">http://www.mediawiki.org/wiki/API:Etiquette</a>). This sort of API is easy and clean for something like a command line script, but if you're going to do further automation or crawling I strongly recommend using the pywikipediabot library (<a href="http://www.mediawiki.org/wiki/Manual:Pywikipediabot" rel="nofollow">http://www.mediawiki.org/wiki/Manual:Pywikipediabot</a>) which includes a very full API, has tunable throttling, and makes a more direct attempt to require a user agent string that is in line with the api etiquette.If you just want a bash script to look things up on wikipedia, you can always use something likefunction wp { curl "<a href="http://en.wikipedia.org/wiki/$(echo" rel="nofollow">http://en.wikipedia.org/wiki/$(echo</a> "$@" | tr ' ' '_')" | gunzip | html2text }which will work for basic queries (needs url encoding and words to be properly capitalized).A full api reference is here (<a href="http://en.wikipedia.org/w/api.php" rel="nofollow">http://en.wikipedia.org/w/api.php</a>).

评论 #6280208 未加载

评论 #6280724 未加载

评论 #6280118 未加载

languagehackerover 11 years ago

Nice work on this. It's always good to see people giving more visibility to MediaWiki's capabilities.To engineers who would like to use this library, I would give a caveat that there are way more API actions than the ones enumerated here. So if you're looking to make some contributions to a project, this one is rife with possible pull requests.In terms of article access and analysis, I'd recommend looking at Pattern (<a href="https://github.com/clips/pattern" rel="nofollow">https://github.com/clips/pattern</a>) before starting with this library. Not only do you get access to the rest of Pattern's IR/text analysis capabilities, but the approach in Pattern is written to support any site built off of MediaWiki, and not just English Wikipedia. This means not only foreign-language Wikipedia instances, but all 350k wikis on Wikia, and Project Gutenberg (which, interestingly enough, runs MW 1.13).

echohackover 11 years ago

Nice. I can see lots of room for improvement. Great start, especially using Requests will streamline things.I'll work in some changes tonight. Let's start with PEP8, shall we? :)

评论 #6280332 未加载

DenisMover 11 years ago

Any advice on stripping wiki markup to obtain plain text from the wikipedia dump? A friend is doing linguistic research and could benefit from large bodies of text in different languages. Ideally this would be a C# library, but a simple command line tool in any other language would do as well - accept .xml.bz2, strip the wiki markup, return something that's easily processable by further tools in a single file. Thanks in advance.

评论 #6281615 未加载

toygover 11 years ago

I've used (and patched) this alternative: <a href="https://github.com/richardasaurus/wiki-api" rel="nofollow">https://github.com/richardasaurus/wiki-api</a>

harlowjaover 11 years ago

You might really want to not cache everything coming back into a never ending python dictionary. Lookup memory leak on wikipedia ;)

评论 #6284521 未加载

level09over 11 years ago

Excellent !Are there any apis for other languages ? tried to query using unicode strings and it worked but I only got English content.

评论 #6280658 未加载

评论 #6280096 未加载

评论 #6280241 未加载

评论 #6280098 未加载

ksrmover 11 years ago

Is there an API for extracting data from infoboxes?

leoplctover 11 years ago

Great! I looking forward for a Ruby version!

评论 #6280110 未加载

评论 #6280271 未加载

评论 #6280661 未加载

10 comments

cjbarberover 11 years ago

评论 #6280153 未加载

评论 #6280121 未加载

nevermoreover 11 years ago

评论 #6280208 未加载

评论 #6280724 未加载

评论 #6280118 未加载

languagehackerover 11 years ago

echohackover 11 years ago

Nice. I can see lots of room for improvement. Great start, especially using Requests will streamline things.I'll work in some changes tonight. Let's start with PEP8, shall we? :)

评论 #6280332 未加载

DenisMover 11 years ago

评论 #6281615 未加载

toygover 11 years ago

I've used (and patched) this alternative: <a href="https://github.com/richardasaurus/wiki-api" rel="nofollow">https://github.com/richardasaurus/wiki-api</a>

harlowjaover 11 years ago

You might really want to not cache everything coming back into a never ending python dictionary. Lookup memory leak on wikipedia ;)

评论 #6284521 未加载

level09over 11 years ago

Excellent !Are there any apis for other languages ? tried to query using unicode strings and it worked but I only got English content.

Show HN: Easy to use Wikipedia API for Python

10 comments

Show HN: Easy to use Wikipedia API for Python

10 comments