TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Full-history English Wikipedia dump produced: 5.6TB uncompressed, 32GB 7z'd

58 点作者 chl大约 15 年前

8 条评论

philwelch大约 15 年前
Doesn't include deleted articles, so no hope if you want to recover one of them. This is a pity since Wikipedia deletes too many articles.
评论 #1265423 未加载
评论 #1265654 未加载
MikeCapone大约 15 年前
&#62; 5.6 Tb uncompressed, 280 Gb in bz2 compression format, 32 Gb in 7z compression format<p>Wow, I didn't know 7z was this much better than bz2. Is this the expected result, or is there something special with Wikipedia that plays to the strengths of 7z?
评论 #1265672 未加载
jonknee大约 15 年前
The article says Tb not TB, but in reality it appears to be TB. That's quite a difference. Still seems heavy for text, but I assume the full text of every revision is in it, not just diffs.
评论 #1265494 未加载
anigbrowl大约 15 年前
Impressive...I wonder how big a content snapshot is, ie no article histories and no meta-material like talk pages or WP:xxx pages, just the user-facing content.<p>I was also sort of hoping to see from the stats what proportion of content was public-facing vs devoted to arguments between wikipedians...if you look at the stats for 'most edited articles' (accessible from the top link) it's interesting that of the top 50 most edited articles, only one, 'George W. Bush' is user-facing - and I suspect that only made it in because of persistent vandalism.<p>Still, with history and all included, there is some fabulous data-mining potential here, with which there's the potential to do some really innovative work. I'd hazard a guess that the size of Wikipedia already exceeds that of existing language corpuses like the US code...<p><i>/retreats into corner muttering about semantic engines and link free concepts of total hypertext as necessary AI boot conditions</i>
评论 #1265632 未加载
bprater大约 15 年前
One wonders if this will be the file first fed into something approximating machine consciousness. I'm not sure where else you can easily get such a high-quantity of fairly consistent human interest data.<p>Quick question: what does "bot-edited" entries refer to?
评论 #1267033 未加载
baddox大约 15 年前
I like the quick fix the site designer used to switch from a static layout to a fluid one.
kez大约 15 年前
Interesting, but somehow doubt that many people have the set-up to handle this number of data.
评论 #1265307 未加载
helwr大约 15 年前
40 + 15 days to compress? how long it would take to decompress this thing
评论 #1265316 未加载
评论 #1265267 未加载