In my experience, one of the hardest parts of writing a web crawler is URL selection:<p>After crawling your list of seed URLs, where do you go next? How do you make sure you don't crawl the same content multiple times because it has a slightly different URL? How to avoid getting stuck on unimportant spam sites with autogenerated content?<p>Because the author only crawled domains from a limited set and only for a short time, he did not need to care about that part. Nonetheless, it's a great article that shows many of the pitfalls of writing a webcrawler.
> I used a Bloom filter to keep track of which urls had already been seen and added to the url frontier. This enabled a very fast check of whether or not a new candidate url should be added to the url frontier, with only a low probability of erroneously adding a url that had already been added.<p>other way round? bloom filter provides a low probability of erroneously believing a URL had already been added when it had not, zero probability of believing a URL had not already been added to the filter when in fact it had.<p>using a bloom filter in this way guarantees you won't ever hit a page twice, but you'll have a non-zero rate of pages you think you've downloaded but you actually haven't, depending how you tune it.
> Code: Originally I intended to make the crawler code available under an open source license at GitHub. However, as I better understood the cost that crawlers impose on websites, I began to have reservations. My crawler is designed to be polite and impose relatively little burden on any single website, but could (like many crawlers) easily be modified by thoughtless or malicious people to impose a heavy burden on sites. Because of this I’ve decided to postpone (possibly indefinitely) releasing the code.<p>That is not a good reason. There are many crawlers out there. Anyone can easily modify the string "robots.txt" in the wget binary to "xobots.txt".<p>Release your code so that others can learn. Stop worrying that you are giving some special tool to bad people - you aren't.
This is a great article. On a side note if you want to do this all day and get paid for it let me know :-) Crawls are the first step of a search engine. Greg Lindahl, CTO at blekko.com) has been writing up a variety of technologies used in our search engine work at High Scalability [1].<p>One of the most interesting things for me is that a lot of the 'frothiest' web pages (those that change every day or several times a day) have become pretty significant chunk of the web from even 5 years ago. I don't see that trend abating much.<p>[1] <a href="http://highscalability.com/" rel="nofollow">http://highscalability.com/</a>
Thank you very much for the post. I have written a distributed crawler at my startup Semantics3* - we track the price and metadata fields from all the major ecommerce sites.<p>Our crawler is written in perl. It uses an evented architecture (written using the AnyEvent library). We use Redis to store state (which urls have been crawled - using the hash- and determine which urls to crawl next - using sorted sets)<p>Instead of using a bloom filter we used sorted sets to dedupe urls and pick the highest priority urls to crawl next (some sort of priority queue).<p>For the actual distribution of crawling (the 'map reduce' part) we use the excellent Gearman work distribution server.<p>One major optimization i can suggest is caching the dns (and also do it asynchronously). You can save a lot of time and resources, especially at that scale, by simply caching dns requests. Another optimization would be to keep the socket connection open and do the download of all the pages from the same domain asynchronously.<p>*Shameless plug: We just launched our private beta. Please sign up and use our API using this link:<p><a href="https://www.semantics3.com/signup?code=ramanujan" rel="nofollow">https://www.semantics3.com/signup?code=ramanujan</a>
Being the CEO of a firm that offers web-crawling services, I found this post very interesting. On 80legs, the cost for a similar crawl would be $500, so it's nice to know we're competitive on cost.
My Master's degree project was a webcrawler. If you're already reading this, the thesis[0] might be a somewhat entertaining read.<p>I had a bit different constraints (only hitting frontpage, cms/webserver/... fingerprinting, backend has to be able to do ad-hoc queries for site features), but it's nice to see that the process is always somewhat the same.<p>One of the most interesting things I experienced was, that link crawling works pretty ok for a certain amount of time, but after you have visited a large amount, bloom filters are pretty much the only way to protect against duplication in a memory efficient way.<p>I switched to a hybrid model where I do still check for links, but to limit the needed depth, I switched to using pinboard/twitter/reddit to find new domains. For bootstrapping you can get your hands on zonefiles from places on the internet (e.g. premiumdrops.com) that will keep you from having to crawl too deep pretty fast.<p>These days, I run on a combination of a worker approach with redis as a queue/cache and straight elasticsearch in the backend. I'm pretty happy with the easy scalability.<p>Web Crawlers are a great weekend project, they allow you to fiddle with evented architectures (github sample [1]), scaling a database and seeing the bottlenecks jump from place to place within your architecture. I can only recommend writing one :)<p>[0] <a href="http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blocks-of-a-scalable-webcrawler/" rel="nofollow">http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo...</a>
[1] <a href="https://github.com/rb2k/em-crawler-sample" rel="nofollow">https://github.com/rb2k/em-crawler-sample</a>
Not bad at all. I build just a few months ago (not publicly released even though I plan) a crawler using NodeJS to take advantage of its evented architecture. I managed to crawl and store (in mongo) more than 300k movies from IMDB in just a few hours (using only a laptop and 8 processes), creating many processes and every one having a specified number of concurrent connections (was based on nodejs cluster and kue lib by learnboost). For html parsing, I used jsdom or cheerio (faster but incomplete), but the process of extracting and storing the data was very faster (prob less than 10 ms for a page). Kue is similar to ruby's resque or python's pyres so the advantage was that every request was basically an independent job using redis as a pubsub.<p>Even though your implementation is a lot complex and very well documented, IMO using non blocking I/O it's a much better solution, because crawling is very intensive I/O and most of the time is spent with the connection (request + response time). Using that many machines and processes, the time should be much shorter with node.
Wow that was informative. I appreciated the author's responsibility the most. Rather than make this a daring adventure or fanciful notion, Nielsen approached the activity with a genuine interest in creating something awesome, not just from the angle of power, though. Great post.
This is not how to crawl webpages. He started with the Alexa list. Those are not necessarily domain names of servers serving webpages. I would guess that some of the request to cease crawling came from some of these listings. Working from the Alexa list he would have been crawling some of the darkest underbelly of the web: ad servers and bulk email services.<p>His question: "Who gets to crawl the web?" is an interesting one though.<p>Do not assume that Googlebot is a smart crawler. Or smarter than all others. The author of Linkers and Loaders posted recently on CircleID about how dumb Googlebot can be.<p>There is no such thing as a smart crawler. All crawlers are stupid. Googlebot resorts to brute force more often than not.<p>Theoretically no one should have to crawl the web. The information should be organised when it is entered into the index.<p>Do you have to "crawl" the Yellow Pages? Are listings arranged by an "algorithm"? PageRank? 80/20 rules?<p>Nothing wrong with those metrics; except of course that they can be gamed trivially, as experiments with Google Scholar have shown. But building a business around this type of ranking? C'mon.<p>If the telephone directories abandoned alpha and subject organisation for "popularity" as a means of organisation it would be total chaos. Which is why "organising the world's information" is an amusing mission statement when your entire business is built around enabling continued chaos and promoting competition for ranking.<p>Even worse are companies like Yelp. It's blackmail.<p>If the information was organised, e.g., alphabetically and regionally, it would be a lot easier to find stuff. Instead, search engines need to spy on users to figure out what they should be letting users choose for themselves. Where "user interfaces" are concerned, it is a fine line between "intuitive" and "manipulative".<p>The people who run search engines and directory sites are not objective. They can be bought. They want to be bought.<p>This brings quality down. As it always has for traditional media as well. But it's much worse with search engines.
You can make fabric execute commands in parallel. The reliability will be as good as it'll get with chef. I've spent ages dealing with edge cases with both fabric and chef setup systems.<p><a href="http://morgangoose.com/blog/2010/10/08/parallel-execution-with-fabric/" rel="nofollow">http://morgangoose.com/blog/2010/10/08/parallel-execution-wi...</a>