TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

A Look Inside Our 210TB 2012 Web Corpus

102 pointsby LisaGalmost 12 years ago

7 comments

mark_l_watsonalmost 12 years ago
Check out the Common Crawl contest winning projects from the linked page - some very good work, and a good source of ideas and techniques: <a href="http://commoncrawl.org/the-winners-of-the-norvig-web-data-science-award/" rel="nofollow">http:&#x2F;&#x2F;commoncrawl.org&#x2F;the-winners-of-the-norvig-web-data-sc...</a><p>Some good stuff!
评论 #6209483 未加载
Aloisiusalmost 12 years ago
Link to the PDF mentioned: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit" rel="nofollow">https:&#x2F;&#x2F;docs.google.com&#x2F;file&#x2F;d&#x2F;1_9698uglerxB9nAglvaHkEgU-iZN...</a>
sylvinusalmost 12 years ago
Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...
评论 #6209371 未加载
评论 #6209292 未加载
评论 #6209128 未加载
评论 #6209978 未加载
评论 #6210062 未加载
rgrieselhuberalmost 12 years ago
Is there something, other than funding, preventing a more regular, open-sourced crawl of the web?
评论 #6208864 未加载
dansoalmost 12 years ago
The tables of TLD frequency on page 4 of the stats report are interesting, though it causes some confusion to me about how the crawler actually crawls and when it stops: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/view?sle=true" rel="nofollow">https:&#x2F;&#x2F;docs.google.com&#x2F;file&#x2F;d&#x2F;1_9698uglerxB9nAglvaHkEgU-iZN...</a><p>Table 2a purports to show the frequency of SLDs:<p>1 youtube.com 95,866,041 0.0250<p>2 blogspot.com 45,738,134 0.0119<p>3 tumblr.com 30,135,714 0.0079<p>4 flickr.com 9,942,237 0.0026<p>5 amazon.com 6,470,283 0.0017<p>6 google.com 2,782,762 0.0007<p>7 thefreedictionary.com 2,183,753 0.0006<p>8 tripod.com 1,874,452 0.0005<p>9 hotels.com 1,733,778 0.0005<p>10 flightaware.com 1,280,875 0.0003<p>If I&#x27;m reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn&#x27;t find a total number of Youtube video count, but Youtube&#x27;s own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie&#x2F;tv studios).<p>In any case, it&#x27;s surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com
评论 #6209435 未加载
评论 #6209778 未加载
spimmyalmost 12 years ago
What do you mean by &quot;open&quot;? Can the data be used for startups and other commercial purposes?
评论 #6208828 未加载
评论 #6208818 未加载
评论 #6208816 未加载
natchalmost 12 years ago
How does one get set up to access the s3:&#x2F;&#x2F; links their blog posts reference? I do realize these point to Amazon S3 buckets, but how to get at them?
评论 #6208954 未加载
评论 #6208845 未加载