TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Full-history English Wikipedia dump produced: 5.6TB uncompressed, 32GB 7z'd

58 pointsby chlabout 15 years ago

8 comments

philwelchabout 15 years ago
Doesn't include deleted articles, so no hope if you want to recover one of them. This is a pity since Wikipedia deletes too many articles.
评论 #1265423 未加载
评论 #1265654 未加载
MikeCaponeabout 15 years ago
&#62; 5.6 Tb uncompressed, 280 Gb in bz2 compression format, 32 Gb in 7z compression format<p>Wow, I didn't know 7z was this much better than bz2. Is this the expected result, or is there something special with Wikipedia that plays to the strengths of 7z?
评论 #1265672 未加载
jonkneeabout 15 years ago
The article says Tb not TB, but in reality it appears to be TB. That's quite a difference. Still seems heavy for text, but I assume the full text of every revision is in it, not just diffs.
评论 #1265494 未加载
anigbrowlabout 15 years ago
Impressive...I wonder how big a content snapshot is, ie no article histories and no meta-material like talk pages or WP:xxx pages, just the user-facing content.<p>I was also sort of hoping to see from the stats what proportion of content was public-facing vs devoted to arguments between wikipedians...if you look at the stats for 'most edited articles' (accessible from the top link) it's interesting that of the top 50 most edited articles, only one, 'George W. Bush' is user-facing - and I suspect that only made it in because of persistent vandalism.<p>Still, with history and all included, there is some fabulous data-mining potential here, with which there's the potential to do some really innovative work. I'd hazard a guess that the size of Wikipedia already exceeds that of existing language corpuses like the US code...<p><i>/retreats into corner muttering about semantic engines and link free concepts of total hypertext as necessary AI boot conditions</i>
评论 #1265632 未加载
bpraterabout 15 years ago
One wonders if this will be the file first fed into something approximating machine consciousness. I'm not sure where else you can easily get such a high-quantity of fairly consistent human interest data.<p>Quick question: what does "bot-edited" entries refer to?
评论 #1267033 未加载
baddoxabout 15 years ago
I like the quick fix the site designer used to switch from a static layout to a fluid one.
kezabout 15 years ago
Interesting, but somehow doubt that many people have the set-up to handle this number of data.
评论 #1265307 未加载
helwrabout 15 years ago
40 + 15 days to compress? how long it would take to decompress this thing
评论 #1265316 未加载
评论 #1265267 未加载