TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Designing a crawler to extract all the links from a website (site map)?

3 点作者 jurgenwerk大约 7 年前
How would you build a bot which receives a website address for an input, and then it extracts and visits all the (sub)pages that can be found on the website? Then, it uses the gathered data to save the statuses of the pages (response codes, title, description, load time...) and it saves the pages in a data structure where it’s possible to create a tree map of the pages (like a folder structure in a file browser). This data structure needs to have weekly snapshots for making comparisons throughout time.<p>I’m thinking about the two main aspects of this bot - first one being the crawling strategy and the second one the data structure to store this data so it can be queried efficiently. Regarding the crawling algorithm, probably the easiest would be:<p>- Visit the page (level 1)<p>- Extract all the internal links<p>- Visit the first link, save data<p>- Go to step 2 (uncover the next level of links)<p>Obviously, there are some critical problems with this strategy. When do we know when we are done? How to prevent cyclical issues? What are the possible problems when crawls are performed concurrently?<p>The second point in question is the database for storing these links. Data should have the following properties:<p>- Associated to a specific website crawl at some point in time (to be compared with other crawls in different time)<p>- Links in each crawl need to be pointed to each other, so a website tree can be constructed.<p>This perhaps calls for a graph database, but that’s expensive (learning it + maintaining cost). What about a traditional RDBMS (Postgres)? A “links” table, referenced by “crawls” and “websites” table, where links are uniquely identified by its URL and can point to other links - for example the parent link (previous level).<p>Can you point me to some good algorithms and strategies?

2 条评论

Piskvorrr大约 7 年前
<a href="https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;wget&#x2F;manual&#x2F;html_node&#x2F;Recursive-Retrieval-Options.html" rel="nofollow">https:&#x2F;&#x2F;www.gnu.org&#x2F;software&#x2F;wget&#x2F;manual&#x2F;html_node&#x2F;Recursive...</a>
tedmiston大约 7 年前
Check out <a href="https:&#x2F;&#x2F;scrapy.org&#x2F;" rel="nofollow">https:&#x2F;&#x2F;scrapy.org&#x2F;</a> to start