TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Wikipedia data dumps and stats

34 点作者 kola大约 12 年前

2 条评论

fauigerzigerk大约 12 年前
Sadly, they don't publish up-to-date HTML dumps and there is no reliable way of reproducing them short of installing the entire wikipedia system locally, including the database. I know there are quite a few projects that claim to do it but they're all abandoned, incomplete or unsuitable in various other ways (as far as I know).
评论 #5432437 未加载
评论 #5431468 未加载
wikiburner大约 12 年前
Hey everybody, fauigerzigerk sort of gets into this, but I just downloaded the dump yesterday expecting there to be a relatively straightforward way to parse and search it with Python and extract and process articles of interest w/ NLTK.<p>I'm not sure what I was expecting exactly, but it sure wasn't a single 40gb XML file that I can't even open in Notepad++.<p>Is my only real option (for parsing and data mining this thing) to basically set up a clone of wikipedia's system, and then screen scrape localhost?
评论 #5431805 未加载
评论 #5431626 未加载