This was written in 2012, Its even easier these days by using SQS and Cloud Formation. 250 Million is a small number you are better of first going through Common Crawl and then use data from crawls to build a better seed list.<p>Common Crawl now contains repeated crawls conducted every few months and also urls donated by blekko.<p><a href="https://groups.google.com/forum/m/#!msg/common-crawl/zexccXgwg4w/oV8qeJnawJUJ" rel="nofollow">https://groups.google.com/forum/m/#!msg/common-crawl/zexccXg...</a>
Web crawling is just like most things: 80% of the results for 20% of the work. It's always the last mile that takes the most significant cost and engineering effort.<p>as your index and scale grow you bump into the really difficult problems:<p>1. How do you handle so many DNS requests/sec without overloading upstream servers?<p>2. How do you discover and determine the quality of new links? It's only a matter of time until your crawler hits a treasure trove of spam domains.<p>3. How do you store, update, and access an index that's exponentially growing?<p>Just some ideas.
No one ever talks about a particular topic though when it comes to web crawling etc. How do you avoid all the "bad" sites as in, really bad shit? The stuff that your ISP could use as evidence against you when in fact it was just your code running and it happened to come across one of these sorts of sites. How do you deal with all that? That is the only thing stopping me from experimenting around with web crawling.
old HN comments ( 3 years ago ) <a href="https://news.ycombinator.com/item?id=4367933" rel="nofollow">https://news.ycombinator.com/item?id=4367933</a>
I feel like more companies are building their businesses around web crawling and parsing data. There are lots of players in the eCommerce space that monitor pricing, search relevance, and product integrity. Each one of these companies has to build some sort of a templating system for defining crawl jobs, a definition of parsing rules to extract the data, and monitoring system to alert when the underlining HTML of a site has changed from their predefined rules. I'm interested in these aspects. Building a distributed crawler is easier than ever.
This isn't particularly difficult anymore. The most interesting challenges in web crawling around turning a diaspora of web content into usable data. E.g., how to get prices from 10 million product listings from 1,000 different e-retailers?
I don't understand his hesitancy in releasing his crawler code. I imagine there are plenty for people to access and alter for malicious use if they desired, so why does releasing his such a big deal?