科技回声

1 comment

jqueryin将近 15 年前

I created this library awhile back to be called from the CLI as a cronjob. The sole purpose is to crawl your entire domain and create a database of all pages, inbound link counts, last modified times, etc. This data can then be used by a separate script for statistical data or generating a sitemap XML file. The inbound links count gives you the capability of adding priorities to your sitemap file. The last modified time gives you a fairly accurate depiction of when the content was last updated for the sitemap file as well. This is a huge step up for sites with dynamic data that don't have any form of modified timestamp association.<p>The largest site I tested this on had ~500 pages. I would recommend you have your memory_limit to at least 32MB in php.ini as the crawler can be fairly memory intensive when it spawns 5 parallel processes for crawling. I did some fairly extensive optimizations to keep the memory limit down; if you spot anything that could be improved upon please let me know.

Caterpillar - A PHP Web Crawler using parallel requests

1 comment

Caterpillar - A PHP Web Crawler using parallel requests

1 comment