TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

102TB of New Crawl Data Available

237 pointsby LisaGover 11 years ago

15 comments

rwgover 11 years ago
I really wanted to love the Common Crawl corpus. I needed an excuse to play with EC2, I had a project idea that would benefit an open source project (Mozilla&#x27;s pdf.js), and I had an AWS gift card with $100 of value on it. But when I actually got to work, I found the choice of Hadoop sequence files containing JSON documents for the crawl metadata absolutely maddening and slammed headfirst into an undocumented gotcha that ultimately killed the project: the documents in the corpus are truncated at ~512 kilobytes.<p>It looks like they&#x27;ve fixed the first problem by switching to gzipped WARC files, but I can&#x27;t find any information about whether or not they&#x27;re still truncating documents in the archive. I guess I&#x27;ll have to give it another look and see...
评论 #6812420 未加载
boyterover 11 years ago
I love common crawl, but as I commented before I still want to see a subset available for download, something like the top million sites or something like that. Certainly a few steps of data, say 50GB 100GB and 200GB.<p>I really think a subset like this will increase the value as it would allow people writing search engines (for fun or profit) to suck a copy down locally and work away. Its something I would like to do for sure.
评论 #6811843 未加载
kohanzover 11 years ago
I&#x27;m curious to hear how people are using Common Crawl data.
dansoover 11 years ago
Very cool...though I have to say, CC is a constant reminder that whatever you put on the Internet will basically remain in the public eye for the perpetuity of electronic communication. There exists ways to remove your (owned) content from archive.org and Google...but once some other independent scraper catches it, you can&#x27;t really do much about it
评论 #6812026 未加载
rb2k_over 11 years ago
Is there an easy way to grab JUST a list of uniq domains?<p>That would be a great starter for all sorts of fun little weekend experiments.
ma2rtenover 11 years ago
I would be great if common crawl (or anyone else) would also release a document-term index for it&#x27;s data. If you had an index, you could do a lot more things with this data.
ecaronover 11 years ago
Anyone have a good understanding of the difference between this and <a href="http://www.dotnetdotcom.org/" rel="nofollow">http:&#x2F;&#x2F;www.dotnetdotcom.org&#x2F;</a>? I&#x27;ve seen Dotbot in my access logs more than CommonCrawl, so I&#x27;m more inclined to believe they have a wider - but not deeper - spread.
recuterover 11 years ago
Anybody want to take a guess at what percentage these 2B pages represent out of the total surface web at least? I can&#x27;t find reliable figures, numbers all over the place. 5 percent?
GigabyteCoinover 11 years ago
Can anyone give me a quick rundown on how exactly one gains access to all of this data?<p>I have heard about this project numerous times, and am always dissuaded by the lack of download links&#x2F;torrents&#x2F;information on their homepage.<p>Perhaps I just don&#x27;t know what I&#x27;m looking at?
评论 #6816245 未加载
DigitalSeaover 11 years ago
I&#x27;ve yet to find an excuse to download some of this data to play with. I have a feeling my ISP will personally send around a bunch of suits to collect the bill payment in person if I were to ever go over my 500gb monthly limit by downloading 102tb of data, haha. I would still like to download a subset of the data, from what I&#x27;ve read apparently that kind of idea is already in the works. I just can&#x27;t possibly think of what I would do, perhaps a machine learning based project.
评论 #6812660 未加载
评论 #6812656 未加载
sirsarover 11 years ago
<i>We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.</i><p>Where can I read more about this?
评论 #6815880 未加载
iamtechaddictover 11 years ago
Is there a way we can access the data(small subet say 30-40GB&#x27;s) without having an AWS account(as it requires a credit card, I&#x27;m a student i don&#x27;t have any) ?
评论 #6815811 未加载
kordlessover 11 years ago
Ah, distributed crawling. What a great idea. :)
csmukover 11 years ago
Well that would take 3.5 years to download on my Internet connection!
manismkuover 11 years ago
That&#x27;s great and cool stuff.