While I know that some of the pages of my home page are in the crawl, they do not show up with the following query:
<a href="http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F~hiemstra" rel="nofollow">http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F...</a>
nor with:
<a href="http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F%7Ehiemstra" rel="nofollow">http://urlsearch.commoncrawl.org/?q=nl.utwente.cs.wwwhome%2F...</a>
(no, this is not only an ego search problem ;-) )
I hope that some of you who use/play around with the Common Crawl data will try out using the JSON files from the URL Search and then share your code.<p>If you didn't see the details in the blog post, Common Crawl is giving out $100 in AWS credit to the first five people who share code that incorporates a JSON file from the URL Search.
From @djoerd
Why does @CommonCrawl URL search (<a href="http://urlsearch.commoncrawl.org/" rel="nofollow">http://urlsearch.commoncrawl.org/</a> ) need 'tld.domain' format rather than 'domain.tld'? Read Google's BigTable paper.