TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What's a good general seed list for a web crawler

8 pointsby jdrockover 16 years ago
I'm developing a web crawler on top of a large distributed computer. As part of the testing process, I want to keep a background job running to keep crawling the web over and over. I was wondering if anyone had some ideas of a general seed list from which the crawler could reach a wide variety of links. It would be great if the links it traversed were a good representation of the Internet as a whole, taking into account content variety, frequency of updates, and other variables.

6 comments

soultover 16 years ago
Wikipedia provides dumps of it's link table: <a href="http://download.wikimedia.org/backup-index.html" rel="nofollow">http://download.wikimedia.org/backup-index.html</a>
alex_cover 16 years ago
I've never done something like this myself, but what about using something like <a href="http://www.dmoz.org/" rel="nofollow">http://www.dmoz.org/</a>?
gojomoover 16 years ago
DMOZ, Wikipedia, Yahoo Directory are the classic broad starting points. You could also begin with the top 100, 500, 1000, etc. sites from some ranking service (like Alexa), or top N results from major search engines on queries of special interest.<p>Depending on how you order discovered URLs and sites for crawling, it may not make too much of a difference where you start a truly web-wide crawl: you'll quickly reach major hubs, and everything else, after a short period. Then it's a matter of where the crawler chooses to spend its attention: which paths, how deep.<p>If you keep crawling 'over and over' you may want to pick what you revisit based on your own followup analysis, not the seeds of your first crawl(s).
fizxover 16 years ago
dmoz
评论 #457097 未加载
评论 #457180 未加载
okeumeniover 16 years ago
Yahoo directory is a good start.
xenophanesover 16 years ago
could you google the dictionary and use top 10 results from each? i don't know if this is decent idea or not. maybe someone will tell me :)