TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

A Look Inside Our 210TB 2012 Web Corpus

102 点作者 LisaG将近 12 年前

7 条评论

mark_l_watson将近 12 年前
Check out the Common Crawl contest winning projects from the linked page - some very good work, and a good source of ideas and techniques: <a href="http://commoncrawl.org/the-winners-of-the-norvig-web-data-science-award/" rel="nofollow">http:&#x2F;&#x2F;commoncrawl.org&#x2F;the-winners-of-the-norvig-web-data-sc...</a><p>Some good stuff!
评论 #6209483 未加载
Aloisius将近 12 年前
Link to the PDF mentioned: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/edit" rel="nofollow">https:&#x2F;&#x2F;docs.google.com&#x2F;file&#x2F;d&#x2F;1_9698uglerxB9nAglvaHkEgU-iZN...</a>
sylvinus将近 12 年前
Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...
评论 #6209371 未加载
评论 #6209292 未加载
评论 #6209128 未加载
评论 #6209978 未加载
评论 #6210062 未加载
rgrieselhuber将近 12 年前
Is there something, other than funding, preventing a more regular, open-sourced crawl of the web?
评论 #6208864 未加载
danso将近 12 年前
The tables of TLD frequency on page 4 of the stats report are interesting, though it causes some confusion to me about how the crawler actually crawls and when it stops: <a href="https://docs.google.com/file/d/1_9698uglerxB9nAglvaHkEgU-iZNm1TvVGuCW7245-WGvZq47teNpb_uL5N9/view?sle=true" rel="nofollow">https:&#x2F;&#x2F;docs.google.com&#x2F;file&#x2F;d&#x2F;1_9698uglerxB9nAglvaHkEgU-iZN...</a><p>Table 2a purports to show the frequency of SLDs:<p>1 youtube.com 95,866,041 0.0250<p>2 blogspot.com 45,738,134 0.0119<p>3 tumblr.com 30,135,714 0.0079<p>4 flickr.com 9,942,237 0.0026<p>5 amazon.com 6,470,283 0.0017<p>6 google.com 2,782,762 0.0007<p>7 thefreedictionary.com 2,183,753 0.0006<p>8 tripod.com 1,874,452 0.0005<p>9 hotels.com 1,733,778 0.0005<p>10 flightaware.com 1,280,875 0.0003<p>If I&#x27;m reading this correctly, it seems that the crawler managed to hit up a huge number of youtube video pages...but only a fraction of them. I couldn&#x27;t find a total number of Youtube video count, but Youtube&#x27;s own stats page says 200 million videos alone have been tagged with Content-ID (identified as belonging to movie&#x2F;tv studios).<p>In any case, it&#x27;s surprising to not see Wikipedia on there. English wikipedia has 4+ million articles, so it should be ahead of thefreedictionary.com
评论 #6209435 未加载
评论 #6209778 未加载
spimmy将近 12 年前
What do you mean by &quot;open&quot;? Can the data be used for startups and other commercial purposes?
评论 #6208828 未加载
评论 #6208818 未加载
评论 #6208816 未加载
natch将近 12 年前
How does one get set up to access the s3:&#x2F;&#x2F; links their blog posts reference? I do realize these point to Amazon S3 buckets, but how to get at them?
评论 #6208954 未加载
评论 #6208845 未加载