TechEcho

7 comments

Check out the Common Crawl contest winning projects from the linked page - some very good work, and a good source of ideas and techniques: <a href="http://commoncrawl.org/the-winners-of-the-norvig-web-data-science-award/" rel="nofollow">http://commoncrawl.org/the-winners-of-the-norvig-web-data-sc...</a>Some good stuff!

评论 #6209483 未加载

Aloisiusalmost 12 years ago

Link to the PDF mentioned: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit" rel="nofollow">https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZN...</a>

sylvinusalmost 12 years ago

Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...

评论 #6209371 未加载

评论 #6209292 未加载

评论 #6209128 未加载

评论 #6209978 未加载

评论 #6210062 未加载

rgrieselhuberalmost 12 years ago

Is there something, other than funding, preventing a more regular, open-sourced crawl of the web?

评论 #6208864 未加载

dansoalmost 12 years ago

The tables of TLD frequency on page 4 of the stats report are interesting, though it causes some confusion to me about how the crawler actually crawls and when it stops: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/view?sle=true" rel="nofollow">https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZN...</a>Table 2a purports to show the frequency of SLDs:1 youtube.com 95,866,041 0.02502 blogspot.com 45,738,134 0.01193 tumblr.com 30,135,714 0.00794 flickr.com 9,942,237 0.00265 amazon.com 6,470,283 0.00176 google.com 2,782,762 0.00077 thefreedictionary.com 2,183,753 0.00068 tripod.com 1,874,452 0.00059 hotels.com 1,733,778 0.000510 flightaware.com 1,280,875 0.0003If I'm reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn't find a total number of Youtube video count, but Youtube's own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie/tv studios).In any case, it's surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com

评论 #6209435 未加载

评论 #6209778 未加载

spimmyalmost 12 years ago

What do you mean by "open"? Can the data be used for startups and other commercial purposes?

评论 #6208828 未加载

评论 #6208818 未加载

评论 #6208816 未加载

natchalmost 12 years ago

How does one get set up to access the s3:// links their blog posts reference? I do realize these point to Amazon S3 buckets, but how to get at them?

评论 #6208954 未加载

评论 #6208845 未加载

7 comments

mark_l_watsonalmost 12 years ago

评论 #6209483 未加载

Aloisiusalmost 12 years ago

sylvinusalmost 12 years ago

Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...

评论 #6209371 未加载

评论 #6209292 未加载

评论 #6209128 未加载

评论 #6209978 未加载

评论 #6210062 未加载

rgrieselhuberalmost 12 years ago

Is there something, other than funding, preventing a more regular, open-sourced crawl of the web?

评论 #6208864 未加载

dansoalmost 12 years ago

评论 #6209435 未加载

评论 #6209778 未加载

spimmyalmost 12 years ago

What do you mean by "open"? Can the data be used for startups and other commercial purposes?

评论 #6208828 未加载

评论 #6208818 未加载

评论 #6208816 未加载

natchalmost 12 years ago

How does one get set up to access the s3:// links their blog posts reference? I do realize these point to Amazon S3 buckets, but how to get at them?

评论 #6208954 未加载

评论 #6208845 未加载

A Look Inside Our 210TB 2012 Web Corpus

7 comments

A Look Inside Our 210TB 2012 Web Corpus

7 comments