TechEcho

8 comments

philwelchabout 15 years ago

Doesn't include deleted articles, so no hope if you want to recover one of them. This is a pity since Wikipedia deletes too many articles.

评论 #1265423 未加载

评论 #1265654 未加载

MikeCaponeabout 15 years ago

> 5.6 Tb uncompressed, 280 Gb in bz2 compression format, 32 Gb in 7z compression formatWow, I didn't know 7z was this much better than bz2. Is this the expected result, or is there something special with Wikipedia that plays to the strengths of 7z?

评论 #1265672 未加载

jonkneeabout 15 years ago

The article says Tb not TB, but in reality it appears to be TB. That's quite a difference. Still seems heavy for text, but I assume the full text of every revision is in it, not just diffs.

评论 #1265494 未加载

anigbrowlabout 15 years ago

Impressive...I wonder how big a content snapshot is, ie no article histories and no meta-material like talk pages or WP:xxx pages, just the user-facing content.I was also sort of hoping to see from the stats what proportion of content was public-facing vs devoted to arguments between wikipedians...if you look at the stats for 'most edited articles' (accessible from the top link) it's interesting that of the top 50 most edited articles, only one, 'George W. Bush' is user-facing - and I suspect that only made it in because of persistent vandalism.Still, with history and all included, there is some fabulous data-mining potential here, with which there's the potential to do some really innovative work. I'd hazard a guess that the size of Wikipedia already exceeds that of existing language corpuses like the US code.../retreats into corner muttering about semantic engines and link free concepts of total hypertext as necessary AI boot conditions

评论 #1265632 未加载

bpraterabout 15 years ago

One wonders if this will be the file first fed into something approximating machine consciousness. I'm not sure where else you can easily get such a high-quantity of fairly consistent human interest data.Quick question: what does "bot-edited" entries refer to?

评论 #1267033 未加载

baddoxabout 15 years ago

I like the quick fix the site designer used to switch from a static layout to a fluid one.

kezabout 15 years ago

Interesting, but somehow doubt that many people have the set-up to handle this number of data.

评论 #1265307 未加载

helwrabout 15 years ago

40 + 15 days to compress? how long it would take to decompress this thing

评论 #1265316 未加载

评论 #1265267 未加载

8 comments

philwelchabout 15 years ago

Doesn't include deleted articles, so no hope if you want to recover one of them. This is a pity since Wikipedia deletes too many articles.

评论 #1265423 未加载

评论 #1265654 未加载

MikeCaponeabout 15 years ago

评论 #1265672 未加载

jonkneeabout 15 years ago

The article says Tb not TB, but in reality it appears to be TB. That's quite a difference. Still seems heavy for text, but I assume the full text of every revision is in it, not just diffs.

评论 #1265494 未加载

anigbrowlabout 15 years ago

评论 #1265632 未加载

bpraterabout 15 years ago

评论 #1267033 未加载

baddoxabout 15 years ago

I like the quick fix the site designer used to switch from a static layout to a fluid one.

kezabout 15 years ago

Interesting, but somehow doubt that many people have the set-up to handle this number of data.

Full-history English Wikipedia dump produced: 5.6TB uncompressed, 32GB 7z'd

8 comments

Full-history English Wikipedia dump produced: 5.6TB uncompressed, 32GB 7z'd

8 comments