Writing a web crawler 5-6 years ago (in PHP no less) was what really turned me into a programmer ... Up to that point I had just been a html/css dude who would hack at PHP as needed, but kept trying to do more and more with it.<p>I learned <i>so</i> much about the interwebs ... http, urls (their construction), html markup and why things work the way they do ... learned about threading, using queues, and finally ... really grokked OOP.<p>Best of all I gained a newfound respect and understanding of Googlebot and web browsers in General ... dealing with people's crazy ass html code is not. easy.<p>If I ever teach a class on programming ... its something I'd love to have my students attempt as a semester long (background) project.<p>Good times.
It is worth emphasizing that what the author is talking about is more akin to screen scraping than web crawling. Both tasks have their challenges, but screen scraping has several that are inherently difficult to overcome.<p>In particular, with screen scraping, you are trying to extract structured data from a markup language (in this case HTML) that simply doesn't guarantee the structure your looking for. With web crawling you only need the structural guarantees offered by the HTML markup (not even that, with the quality of libraries such as TagSoup or Neko).<p>Now, that isn't to say web crawling doesn't have its own challenges (URL canonicalization anyone?).
If the conclusion is to not write a web crawler, the next step is a service like <a href="http://www.80legs.com/" rel="nofollow">http://www.80legs.com/</a>.
"Sites will ban you"<p>Wish it would happen here on HN. Seems like it's a weekly occurrence where someone announces a pet project that involves crawling the whole site. I can't help but to think this is connected to the 30+ second page loads I get here often.