TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

102TB of New Crawl Data Available

237 点作者 LisaG超过 11 年前

15 条评论

rwg超过 11 年前
I really wanted to love the Common Crawl corpus. I needed an excuse to play with EC2, I had a project idea that would benefit an open source project (Mozilla&#x27;s pdf.js), and I had an AWS gift card with $100 of value on it. But when I actually got to work, I found the choice of Hadoop sequence files containing JSON documents for the crawl metadata absolutely maddening and slammed headfirst into an undocumented gotcha that ultimately killed the project: the documents in the corpus are truncated at ~512 kilobytes.<p>It looks like they&#x27;ve fixed the first problem by switching to gzipped WARC files, but I can&#x27;t find any information about whether or not they&#x27;re still truncating documents in the archive. I guess I&#x27;ll have to give it another look and see...
评论 #6812420 未加载
boyter超过 11 年前
I love common crawl, but as I commented before I still want to see a subset available for download, something like the top million sites or something like that. Certainly a few steps of data, say 50GB 100GB and 200GB.<p>I really think a subset like this will increase the value as it would allow people writing search engines (for fun or profit) to suck a copy down locally and work away. Its something I would like to do for sure.
评论 #6811843 未加载
kohanz超过 11 年前
I&#x27;m curious to hear how people are using Common Crawl data.
danso超过 11 年前
Very cool...though I have to say, CC is a constant reminder that whatever you put on the Internet will basically remain in the public eye for the perpetuity of electronic communication. There exists ways to remove your (owned) content from archive.org and Google...but once some other independent scraper catches it, you can&#x27;t really do much about it
评论 #6812026 未加载
rb2k_超过 11 年前
Is there an easy way to grab JUST a list of uniq domains?<p>That would be a great starter for all sorts of fun little weekend experiments.
ma2rten超过 11 年前
I would be great if common crawl (or anyone else) would also release a document-term index for it&#x27;s data. If you had an index, you could do a lot more things with this data.
ecaron超过 11 年前
Anyone have a good understanding of the difference between this and <a href="http://www.dotnetdotcom.org/" rel="nofollow">http:&#x2F;&#x2F;www.dotnetdotcom.org&#x2F;</a>? I&#x27;ve seen Dotbot in my access logs more than CommonCrawl, so I&#x27;m more inclined to believe they have a wider - but not deeper - spread.
recuter超过 11 年前
Anybody want to take a guess at what percentage these 2B pages represent out of the total surface web at least? I can&#x27;t find reliable figures, numbers all over the place. 5 percent?
GigabyteCoin超过 11 年前
Can anyone give me a quick rundown on how exactly one gains access to all of this data?<p>I have heard about this project numerous times, and am always dissuaded by the lack of download links&#x2F;torrents&#x2F;information on their homepage.<p>Perhaps I just don&#x27;t know what I&#x27;m looking at?
评论 #6816245 未加载
DigitalSea超过 11 年前
I&#x27;ve yet to find an excuse to download some of this data to play with. I have a feeling my ISP will personally send around a bunch of suits to collect the bill payment in person if I were to ever go over my 500gb monthly limit by downloading 102tb of data, haha. I would still like to download a subset of the data, from what I&#x27;ve read apparently that kind of idea is already in the works. I just can&#x27;t possibly think of what I would do, perhaps a machine learning based project.
评论 #6812660 未加载
评论 #6812656 未加载
sirsar超过 11 年前
<i>We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.</i><p>Where can I read more about this?
评论 #6815880 未加载
iamtechaddict超过 11 年前
Is there a way we can access the data(small subet say 30-40GB&#x27;s) without having an AWS account(as it requires a credit card, I&#x27;m a student i don&#x27;t have any) ?
评论 #6815811 未加载
kordless超过 11 年前
Ah, distributed crawling. What a great idea. :)
csmuk超过 11 年前
Well that would take 3.5 years to download on my Internet connection!
manismku超过 11 年前
That&#x27;s great and cool stuff.