I guess this fits in here:<p>Once upon a time I wrote my thesis on building a web crawler.
The (tiny) blog post with an embedded preview:<p><a href="http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blocks-of-a-scalable-webcrawler/" rel="nofollow">http://blog.marc-seeger.de/2010/12/09/my-thesis-building-blo...</a><p>The PDF itself:<p><a href="http://blog.marc-seeger.de/assets/papers/thesis_seeger-building_blocks_of_a_scalable_webcrawler.pdf" rel="nofollow">http://blog.marc-seeger.de/assets/papers/thesis_seeger-build...</a><p>It's mostly a "this is what I learned and the things I had to take into consideration" with a few "this is how you identify a CMS" bits sprinkled into it. These days I would probably change a thing or two, but people told me it's still an entertaining read.
(Not a native speaker though, so the English might have some stylistic kinks)
Part 2, is already there
<a href="http://engineering.bloomreach.com/crawling-billions-of-pages-building-large-scale-crawling-cluster-part-2/" rel="nofollow">http://engineering.bloomreach.com/crawling-billions-of-pages...</a>
> The Windows operating system can dispatch different events to different window handlers so you can handle all asynchronous HTTP calls efficiently. For a very long time, people weren’t able to do this on Linux-based operating systems since the underlying socket library contained a potential bottleneck.<p>What? select()'s biggest issue is if you have lots of idle connections, which shouldn't be an issue when crawling (you can send more requests while waiting for responses). epoll() is available since 2003. What bottlenecks?
The challenges with crawling on a large scale still persist as is evident by bloomreach and many other companies building custom solutions because available open source tools cannot handle the scale of such products. SQLBot aims to solve this problem.
Product a few weeks from launch. If any is interested: <a href="http://www.amisalabs.com/AmisaSQLBot.html" rel="nofollow">http://www.amisalabs.com/AmisaSQLBot.html</a>
From part 2 of their article:<p>> Currently, more than 60 percent of global internet traffic consists of requests from crawlers or some type of automated Web discovery system.<p>Where is this number from and how accurate can you make it?
I wish there were more articles about determining the frequency at which one page should be crawled. Some pages never change, some change multiple times per minute, and we do not want to crawl them all equally often.