IIRC, sriramk from around here (<a href="http://news.ycombinator.com/user?id=sriramk" rel="nofollow">http://news.ycombinator.com/user?id=sriramk</a>) had also 'rolled his own' web-crawler as a project in college about 5-6 (?) years back. He blogged about it fairly actively back then, and I really enjoyed following his journey (esp. when after months of dev and testing, he finally 'slipped it into the wild'). Tried to dredge up those posts, but he seems to have taken them down :( A shame really - they were quite a fascinating look at the early-stage evolution of a programmer!<p>Sriram, you around? ;)
I like Ted Dziuba solution:<p><a href="http://teddziuba.com/2010/10/taco-bell-programming.html" rel="nofollow">http://teddziuba.com/2010/10/taco-bell-programming.html</a><p>Full-stack programmer at work!
A good read and very timely from my perspective. We created a crawler in Python a couple of years ago for RSS feeds, but we ran into a number of issues with it, so put it on hold as we concentrated on work that made money :) We started to look at the project last week and we've been looking at rolling our own versus looking at frameworks like Scrapy. The main thing for us is being able to scale. Anyone who has knowledge of creating a distributed crawler in Python I'd welcome some advice.<p>Thanks again. Really good post