> 5.6 Tb uncompressed, 280 Gb in bz2 compression format, 32 Gb in 7z compression format<p>Wow, I didn't know 7z was this much better than bz2. Is this the expected result, or is there something special with Wikipedia that plays to the strengths of 7z?
The article says Tb not TB, but in reality it appears to be TB. That's quite a difference. Still seems heavy for text, but I assume the full text of every revision is in it, not just diffs.
Impressive...I wonder how big a content snapshot is, ie no article histories and no meta-material like talk pages or WP:xxx pages, just the user-facing content.<p>I was also sort of hoping to see from the stats what proportion of content was public-facing vs devoted to arguments between wikipedians...if you look at the stats for 'most edited articles' (accessible from the top link) it's interesting that of the top 50 most edited articles, only one, 'George W. Bush' is user-facing - and I suspect that only made it in because of persistent vandalism.<p>Still, with history and all included, there is some fabulous data-mining potential here, with which there's the potential to do some really innovative work. I'd hazard a guess that the size of Wikipedia already exceeds that of existing language corpuses like the US code...<p><i>/retreats into corner muttering about semantic engines and link free concepts of total hypertext as necessary AI boot conditions</i>
One wonders if this will be the file first fed into something approximating machine consciousness. I'm not sure where else you can easily get such a high-quantity of fairly consistent human interest data.<p>Quick question: what does "bot-edited" entries refer to?