TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Tiny, dirty, iffy, good enough, basic multi-threaded web crawler in Python

14 点作者 rangeva将近 10 年前

5 条评论

tokenizerrr将近 10 年前
The regex will break on<p><pre><code> &lt;a href=&#x27;actualLink&#x27; _href=&#x27;spoofedLink&#x27; </code></pre> And will return spoofedLink instead of actualLink, while browsers will follow actualLink. This is why you shouldn&#x27;t be trying to parse xml&#x2F;html with regexes.
评论 #10047643 未加载
sebcat将近 10 年前
Crawling can be broken down to:<p><pre><code> 1) fetching resources 2) finding out what new resources to fetch </code></pre> 1) is an network bound problem, 2) is mostly disk&#x2F;CPU bound. Realizing the difference between these two things and separating them is the key to building a good crawler.<p>Depending on how you find out what resources to fetch (parsing static documents vs. dynamic JS analysis with multiple dependencies on other resources (included JS &amp;c)), &quot;good-enough&quot; crawlers are mostly bound to the network.<p>I&#x27;ve seen people running 1 crawl&#x2F;process on their back-end and some management guy saying &quot;we need to crawl faster, add more threads per crawl&quot; when one crawl cycle spends 10x times more waiting on the network than it does parsing a document.
评论 #10047542 未加载
anc84将近 10 年前
I highly recommend you check out <a href="https:&#x2F;&#x2F;github.com&#x2F;chfoo&#x2F;wpull" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;chfoo&#x2F;wpull</a>
roma1n将近 10 年前
Nice to see a tiny, useful code example.
emilssolmanis将近 10 年前
Also happens to parse XML with regexes. Lovely.