科技回声

7 条评论

Check out the Common Crawl contest winning projects from the linked page - some very good work, and a good source of ideas and techniques: <a href="http://commoncrawl.org/the-winners-of-the-norvig-web-data-science-award/" rel="nofollow">http://commoncrawl.org/the-winners-of-the-norvig-web-data-sc...</a>Some good stuff!

评论 #6209483 未加载

Aloisius将近 12 年前

Link to the PDF mentioned: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit" rel="nofollow">https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZN...</a>

sylvinus将近 12 年前

Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...

评论 #6209371 未加载

评论 #6209292 未加载

评论 #6209128 未加载

评论 #6209978 未加载

评论 #6210062 未加载

rgrieselhuber将近 12 年前

Is there something, other than funding, preventing a more regular, open-sourced crawl of the web?

评论 #6208864 未加载

danso将近 12 年前

The tables of TLD frequency on page 4 of the stats report are interesting, though it causes some confusion to me about how the crawler actually crawls and when it stops: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/view?sle=true" rel="nofollow">https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZN...</a>Table 2a purports to show the frequency of SLDs:1 youtube.com 95,866,041 0.02502 blogspot.com 45,738,134 0.01193 tumblr.com 30,135,714 0.00794 flickr.com 9,942,237 0.00265 amazon.com 6,470,283 0.00176 google.com 2,782,762 0.00077 thefreedictionary.com 2,183,753 0.00068 tripod.com 1,874,452 0.00059 hotels.com 1,733,778 0.000510 flightaware.com 1,280,875 0.0003If I'm reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn't find a total number of Youtube video count, but Youtube's own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie/tv studios).In any case, it's surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com

评论 #6209435 未加载

评论 #6209778 未加载

spimmy将近 12 年前

What do you mean by "open"? Can the data be used for startups and other commercial purposes?

评论 #6208828 未加载

评论 #6208818 未加载

评论 #6208816 未加载

natch将近 12 年前

How does one get set up to access the s3:// links their blog posts reference? I do realize these point to Amazon S3 buckets, but how to get at them?

评论 #6208954 未加载

评论 #6208845 未加载

7 条评论

mark_l_watson将近 12 年前

评论 #6209483 未加载

Aloisius将近 12 年前

sylvinus将近 12 年前

Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...

评论 #6209371 未加载

评论 #6209292 未加载

评论 #6209128 未加载

评论 #6209978 未加载

评论 #6210062 未加载

rgrieselhuber将近 12 年前

Is there something, other than funding, preventing a more regular, open-sourced crawl of the web?

评论 #6208864 未加载

danso将近 12 年前

评论 #6209435 未加载

评论 #6209778 未加载

spimmy将近 12 年前

What do you mean by "open"? Can the data be used for startups and other commercial purposes?

评论 #6208828 未加载

评论 #6208818 未加载

评论 #6208816 未加载

natch将近 12 年前

How does one get set up to access the s3:// links their blog posts reference? I do realize these point to Amazon S3 buckets, but how to get at them?

评论 #6208954 未加载

评论 #6208845 未加载

A Look Inside Our 210TB 2012 Web Corpus

7 条评论

A Look Inside Our 210TB 2012 Web Corpus

7 条评论