I really wanted to love the Common Crawl corpus. I needed an excuse to play with EC2, I had a project idea that would benefit an open source project (Mozilla's pdf.js), and I had an AWS gift card with $100 of value on it. But when I actually got to work, I found the choice of Hadoop sequence files containing JSON documents for the crawl metadata absolutely maddening and slammed headfirst into an undocumented gotcha that ultimately killed the project: the documents in the corpus are truncated at ~512 kilobytes.<p>It looks like they've fixed the first problem by switching to gzipped WARC files, but I can't find any information about whether or not they're still truncating documents in the archive. I guess I'll have to give it another look and see...
I love common crawl, but as I commented before I still want to see a subset available for download, something like the top million sites or something like that. Certainly a few steps of data, say 50GB 100GB and 200GB.<p>I really think a subset like this will increase the value as it would allow people writing search engines (for fun or profit) to suck a copy down locally and work away. Its something I would like to do for sure.
Very cool...though I have to say, CC is a constant reminder that whatever you put on the Internet will basically remain in the public eye for the perpetuity of electronic communication. There exists ways to remove your (owned) content from archive.org and Google...but once some other independent scraper catches it, you can't really do much about it
I would be great if common crawl (or anyone else) would also release a document-term index for it's data. If you had an index, you could do a lot more things with this data.
Anyone have a good understanding of the difference between this and <a href="http://www.dotnetdotcom.org/" rel="nofollow">http://www.dotnetdotcom.org/</a>? I've seen Dotbot in my access logs more than CommonCrawl, so I'm more inclined to believe they have a wider - but not deeper - spread.
Anybody want to take a guess at what percentage these 2B pages represent out of the total surface web at least? I can't find reliable figures, numbers all over the place. 5 percent?
Can anyone give me a quick rundown on how exactly one gains access to all of this data?<p>I have heard about this project numerous times, and am always dissuaded by the lack of download links/torrents/information on their homepage.<p>Perhaps I just don't know what I'm looking at?
I've yet to find an excuse to download some of this data to play with. I have a feeling my ISP will personally send around a bunch of suits to collect the bill payment in person if I were to ever go over my 500gb monthly limit by downloading 102tb of data, haha. I would still like to download a subset of the data, from what I've read apparently that kind of idea is already in the works. I just can't possibly think of what I would do, perhaps a machine learning based project.
<i>We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.</i><p>Where can I read more about this?
Is there a way we can access the data(small subet say 30-40GB's) without having an AWS account(as it requires a credit card, I'm a student i don't have any) ?