TechEcho

8 comments

This was written in 2012, Its even easier these days by using SQS and Cloud Formation. 250 Million is a small number you are better of first going through Common Crawl and then use data from crawls to build a better seed list.Common Crawl now contains repeated crawls conducted every few months and also urls donated by blekko.<a href="https://groups.google.com/forum/m/#!msg/common-crawl/zexccXgwg4w/oV8qeJnawJUJ" rel="nofollow">https://groups.google.com/forum/m/#!msg/common-crawl/zexccXg...</a>

评论 #10867773 未加载

评论 #10867432 未加载

评论 #10867457 未加载

worried_citizenover 9 years ago

Web crawling is just like most things: 80% of the results for 20% of the work. It's always the last mile that takes the most significant cost and engineering effort.as your index and scale grow you bump into the really difficult problems:1. How do you handle so many DNS requests/sec without overloading upstream servers?2. How do you discover and determine the quality of new links? It's only a matter of time until your crawler hits a treasure trove of spam domains.3. How do you store, update, and access an index that's exponentially growing?Just some ideas.

评论 #10867694 未加载

supernameover 9 years ago

No one ever talks about a particular topic though when it comes to web crawling etc. How do you avoid all the "bad" sites as in, really bad shit? The stuff that your ISP could use as evidence against you when in fact it was just your code running and it happened to come across one of these sorts of sites. How do you deal with all that? That is the only thing stopping me from experimenting around with web crawling.

评论 #10868662 未加载

pellaover 9 years ago

old HN comments ( 3 years ago ) <a href="https://news.ycombinator.com/item?id=4367933" rel="nofollow">https://news.ycombinator.com/item?id=4367933</a>

tegansnyderover 9 years ago

I feel like more companies are building their businesses around web crawling and parsing data. There are lots of players in the eCommerce space that monitor pricing, search relevance, and product integrity. Each one of these companies has to build some sort of a templating system for defining crawl jobs, a definition of parsing rules to extract the data, and monitoring system to alert when the underlining HTML of a site has changed from their predefined rules. I'm interested in these aspects. Building a distributed crawler is easier than ever.

jdrockover 9 years ago

This isn't particularly difficult anymore. The most interesting challenges in web crawling around turning a diaspora of web content into usable data. E.g., how to get prices from 10 million product listings from 1,000 different e-retailers?

packersvilleover 9 years ago

I don't understand his hesitancy in releasing his crawler code. I imagine there are plenty for people to access and alter for malicious use if they desired, so why does releasing his such a big deal?

pbreitover 9 years ago

Is "quarter billion" used to make it sound like a bigger number? Even "half" is aggressive, imo.

评论 #10868280 未加载

8 comments

secondtimeuseover 9 years ago

评论 #10867773 未加载

评论 #10867432 未加载

评论 #10867457 未加载

worried_citizenover 9 years ago

评论 #10867694 未加载

supernameover 9 years ago

评论 #10868662 未加载

pellaover 9 years ago

old HN comments ( 3 years ago ) <a href="https://news.ycombinator.com/item?id=4367933" rel="nofollow">https://news.ycombinator.com/item?id=4367933</a>

tegansnyderover 9 years ago

jdrockover 9 years ago

packersvilleover 9 years ago

I don't understand his hesitancy in releasing his crawler code. I imagine there are plenty for people to access and alter for malicious use if they desired, so why does releasing his such a big deal?

pbreitover 9 years ago

Is "quarter billion" used to make it sound like a bigger number? Even "half" is aggressive, imo.

评论 #10868280 未加载

How to crawl a quarter billion webpages in 40 hours (2012)

8 comments

How to crawl a quarter billion webpages in 40 hours (2012)

8 comments