I'm developing a web crawler on top of a large distributed computer. As part of the testing process, I want to keep a background job running to keep crawling the web over and over. I was wondering if anyone had some ideas of a general seed list from which the crawler could reach a wide variety of links. It would be great if the links it traversed were a good representation of the Internet as a whole, taking into account content variety, frequency of updates, and other variables.
Wikipedia provides dumps of it's link table: <a href="http://download.wikimedia.org/backup-index.html" rel="nofollow">http://download.wikimedia.org/backup-index.html</a>
I've never done something like this myself, but what about using something like <a href="http://www.dmoz.org/" rel="nofollow">http://www.dmoz.org/</a>?
DMOZ, Wikipedia, Yahoo Directory are the classic broad starting points. You could also begin with the top 100, 500, 1000, etc. sites from some ranking service (like Alexa), or top N results from major search engines on queries of special interest.<p>Depending on how you order discovered URLs and sites for crawling, it may not make too much of a difference where you start a truly web-wide crawl: you'll quickly reach major hubs, and everything else, after a short period. Then it's a matter of where the crawler chooses to spend its attention: which paths, how deep.<p>If you keep crawling 'over and over' you may want to pick what you revisit based on your own followup analysis, not the seeds of your first crawl(s).