科技回声

Sadly, they don't publish up-to-date HTML dumps and there is no reliable way of reproducing them short of installing the entire wikipedia system locally, including the database. I know there are quite a few projects that claim to do it but they're all abandoned, incomplete or unsuitable in various other ways (as far as I know).

Hey everybody, fauigerzigerk sort of gets into this, but I just downloaded the dump yesterday expecting there to be a relatively straightforward way to parse and search it with Python and extract and process articles of interest w/ NLTK.<p>I'm not sure what I was expecting exactly, but it sure wasn't a single 40gb XML file that I can't even open in Notepad++.<p>Is my only real option (for parsing and data mining this thing) to basically set up a clone of wikipedia's system, and then screen scrape localhost?

Wikipedia data dumps and stats

2 条评论

Wikipedia data dumps and stats

2 条评论