TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Caterpillar - A PHP Web Crawler using parallel requests

1 点作者 jqueryin将近 15 年前

1 comment

jqueryin将近 15 年前
I created this library awhile back to be called from the CLI as a cronjob. The sole purpose is to crawl your entire domain and create a database of all pages, inbound link counts, last modified times, etc. This data can then be used by a separate script for statistical data or generating a sitemap XML file. The inbound links count gives you the capability of adding priorities to your sitemap file. The last modified time gives you a fairly accurate depiction of when the content was last updated for the sitemap file as well. This is a huge step up for sites with dynamic data that don't have any form of modified timestamp association.<p>The largest site I tested this on had ~500 pages. I would recommend you have your memory_limit to at least 32MB in php.ini as the crawler can be fairly memory intensive when it spawns 5 parallel processes for crawling. I did some fairly extensive optimizations to keep the memory limit down; if you spot anything that could be improved upon please let me know.