[lightly modified version of a comment I put on the article as I love HN for discussion!]<p>Great article -- we're excited there's so much interest in the web as a dataset! I'm part of the team at Common Crawl and thought I'd clarify some points in the article.<p>The most important is that you can download all the data that Common Crawl provides completely for free, without the need to pay S3 transfer fees or process it only in an EC2 cluster. You don't even need to have an Amazon account! Our crawl archive blog posts give full details for downloading[1]. The main challenge then is storing it, as the full dataset is really quite large, but a number of universities have pulled down a significant portion onto their local clusters.<p>Also, we're performing the crawl once a month now. The monthly crawl archives are between 35-70 terabytes compressed. As such, we've actually crawled and stored over a quarter petabyte compressed, or 1.3 petabytes uncompressed, so far in 2014. (The archives go back to 2008.)<p>Comparing directly against the Internet Archive datasets is a bit like comparing apples to oranges. They store images and other types of binary content as well, whilst Common Crawl aims primarily for HTML, which compresses better. Also, the numbers used for Internet Archive were for all of the crawls they've done, and in our case the numbers were for a single month's crawl.<p>We're excited to see Martin use one of our crawl archives in his work -- seeing these experiments come to life the best part of working at Common Crawl! I can confirm that optimizations will help you lower that EC2 figure. We can process a fairly intensive MR job over a standard crawl archive in afternoon for about $30. Big data on a small budget is a top priority for us!<p>[1]: <a href="http://blog.commoncrawl.org/2014/11/october-2014-crawl-archive-available/" rel="nofollow">http://blog.commoncrawl.org/2014/11/october-2014-crawl-archi...</a>