Crawling can be broken down to:<p><pre><code> 1) fetching resources
2) finding out what new resources to fetch
</code></pre>
1) is an network bound problem, 2) is mostly disk/CPU bound. Realizing the difference between these two things and separating them is the key to building a good crawler.<p>Depending on how you find out what resources to fetch (parsing static documents vs. dynamic JS analysis with multiple dependencies on other resources (included JS &c)), "good-enough" crawlers are mostly bound to the network.<p>I've seen people running 1 crawl/process on their back-end and some management guy saying "we need to crawl faster, add more threads per crawl" when one crawl cycle spends 10x times more waiting on the network than it does parsing a document.