科技回声

5 条评论

The regex will break on<p><pre><code> <a href='actualLink' _href='spoofedLink' </code></pre> And will return spoofedLink instead of actualLink, while browsers will follow actualLink. This is why you shouldn't be trying to parse xml/html with regexes.

评论 #10047643 未加载

sebcat将近 10 年前

Crawling can be broken down to:<p><pre><code> 1) fetching resources 2) finding out what new resources to fetch </code></pre> 1) is an network bound problem, 2) is mostly disk/CPU bound. Realizing the difference between these two things and separating them is the key to building a good crawler.<p>Depending on how you find out what resources to fetch (parsing static documents vs. dynamic JS analysis with multiple dependencies on other resources (included JS &c)), "good-enough" crawlers are mostly bound to the network.<p>I've seen people running 1 crawl/process on their back-end and some management guy saying "we need to crawl faster, add more threads per crawl" when one crawl cycle spends 10x times more waiting on the network than it does parsing a document.

评论 #10047542 未加载

anc84将近 10 年前

I highly recommend you check out <a href="https://github.com/chfoo/wpull" rel="nofollow">https://github.com/chfoo/wpull</a>

roma1n将近 10 年前

Nice to see a tiny, useful code example.

emilssolmanis将近 10 年前

Also happens to parse XML with regexes. Lovely.

Tiny, dirty, iffy, good enough, basic multi-threaded web crawler in Python

5 条评论

Tiny, dirty, iffy, good enough, basic multi-threaded web crawler in Python

5 条评论