科技回声

15 条评论

rwg超过 11 年前

I really wanted to love the Common Crawl corpus. I needed an excuse to play with EC2, I had a project idea that would benefit an open source project (Mozilla's pdf.js), and I had an AWS gift card with $100 of value on it. But when I actually got to work, I found the choice of Hadoop sequence files containing JSON documents for the crawl metadata absolutely maddening and slammed headfirst into an undocumented gotcha that ultimately killed the project: the documents in the corpus are truncated at ~512 kilobytes.It looks like they've fixed the first problem by switching to gzipped WARC files, but I can't find any information about whether or not they're still truncating documents in the archive. I guess I'll have to give it another look and see...

评论 #6812420 未加载

boyter超过 11 年前

I love common crawl, but as I commented before I still want to see a subset available for download, something like the top million sites or something like that. Certainly a few steps of data, say 50GB 100GB and 200GB.I really think a subset like this will increase the value as it would allow people writing search engines (for fun or profit) to suck a copy down locally and work away. Its something I would like to do for sure.

评论 #6811843 未加载

kohanz超过 11 年前

I'm curious to hear how people are using Common Crawl data.

danso超过 11 年前

Very cool...though I have to say, CC is a constant reminder that whatever you put on the Internet will basically remain in the public eye for the perpetuity of electronic communication. There exists ways to remove your (owned) content from archive.org and Google...but once some other independent scraper catches it, you can't really do much about it

评论 #6812026 未加载

rb2k_超过 11 年前

Is there an easy way to grab JUST a list of uniq domains?That would be a great starter for all sorts of fun little weekend experiments.

ma2rten超过 11 年前

I would be great if common crawl (or anyone else) would also release a document-term index for it's data. If you had an index, you could do a lot more things with this data.

ecaron超过 11 年前

Anyone have a good understanding of the difference between this and <a href="http://www.dotnetdotcom.org/" rel="nofollow">http://www.dotnetdotcom.org/</a>? I've seen Dotbot in my access logs more than CommonCrawl, so I'm more inclined to believe they have a wider - but not deeper - spread.

recuter超过 11 年前

Anybody want to take a guess at what percentage these 2B pages represent out of the total surface web at least? I can't find reliable figures, numbers all over the place. 5 percent?

GigabyteCoin超过 11 年前

Can anyone give me a quick rundown on how exactly one gains access to all of this data?I have heard about this project numerous times, and am always dissuaded by the lack of download links/torrents/information on their homepage.Perhaps I just don't know what I'm looking at?

评论 #6816245 未加载

DigitalSea超过 11 年前

I've yet to find an excuse to download some of this data to play with. I have a feeling my ISP will personally send around a bunch of suits to collect the bill payment in person if I were to ever go over my 500gb monthly limit by downloading 102tb of data, haha. I would still like to download a subset of the data, from what I've read apparently that kind of idea is already in the works. I just can't possibly think of what I would do, perhaps a machine learning based project.

评论 #6812660 未加载

评论 #6812656 未加载

sirsar超过 11 年前

We have switched the metadata files from JSON to WAT files. The JSON format did not allow specifying the multiple offsets to files necessary for the WARC upgrade and WAT files provide more detail.Where can I read more about this?

评论 #6815880 未加载

iamtechaddict超过 11 年前

Is there a way we can access the data(small subet say 30-40GB's) without having an AWS account(as it requires a credit card, I'm a student i don't have any) ?

评论 #6815811 未加载

kordless超过 11 年前

Ah, distributed crawling. What a great idea. :)

csmuk超过 11 年前

Well that would take 3.5 years to download on my Internet connection!

manismku超过 11 年前

That's great and cool stuff.

15 条评论

rwg超过 11 年前

评论 #6812420 未加载

boyter超过 11 年前

评论 #6811843 未加载

kohanz超过 11 年前

I'm curious to hear how people are using Common Crawl data.

danso超过 11 年前

评论 #6812026 未加载

rb2k_超过 11 年前

Is there an easy way to grab JUST a list of uniq domains?That would be a great starter for all sorts of fun little weekend experiments.

ma2rten超过 11 年前

I would be great if common crawl (or anyone else) would also release a document-term index for it's data. If you had an index, you could do a lot more things with this data.

ecaron超过 11 年前

recuter超过 11 年前

Anybody want to take a guess at what percentage these 2B pages represent out of the total surface web at least? I can't find reliable figures, numbers all over the place. 5 percent?

GigabyteCoin超过 11 年前

评论 #6816245 未加载

DigitalSea超过 11 年前

评论 #6812660 未加载

评论 #6812656 未加载

sirsar超过 11 年前

评论 #6815880 未加载

iamtechaddict超过 11 年前

Is there a way we can access the data(small subet say 30-40GB's) without having an AWS account(as it requires a credit card, I'm a student i don't have any) ?

评论 #6815811 未加载

kordless超过 11 年前

Ah, distributed crawling. What a great idea. :)

csmuk超过 11 年前

Well that would take 3.5 years to download on my Internet connection!

manismku超过 11 年前

That's great and cool stuff.

102TB of New Crawl Data Available

15 条评论

102TB of New Crawl Data Available

15 条评论