TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

A Web Crawler with Asyncio Coroutines

95 点作者 nickpresta超过 9 年前

7 条评论

theVirginian超过 9 年前
Great tutorial, I would love to see this rewritten with the new async / await syntax in python 3.5
评论 #10236545 未加载
potatosareok超过 9 年前
One question I have about this - and I might have missed in article is - I&#x27;m all for using asyncio to make HTTP requests. But I see they apparently also use asyncio for &quot;parse_links&quot;. Since parselinks should be CPU op, would it make sense to use fibers to download links and pass them into a thread pool to actually parse them&#x2F;&#x2F;add to queue?<p>I&#x27;m messing around with some of the ParallelUniverse Java fiber implementation and what I do is spam fibers to download pages and send the String response over to another fiber over a channel that maintains a thread pool to parse response body as they come in&#x2F;&#x2F;create new fibers to read these links.<p>I&#x27;m really just doing this to get more familiar with async programming and specifically the paralleluniverse Java libs but one thing I&#x27;m struggling a bit with is how to best make it well behaved (e.g right now there&#x27;s no bound on number of outstanding HTPT requests).
Schwolop超过 9 年前
This article is way more important than the web crawler example used to motivate it. It&#x27;s easily the single best thing I&#x27;ve ever read on asyncio, and I&#x27;ve been using it in anger for a year now. I&#x27;ve passed it around my team, and will be recommending it far and wide!
fabiandesimone超过 9 年前
I&#x27;m working on a project that involves lot&#x27;s of web crawling. I&#x27;m not technical at all (I&#x27;m hiring freelancers).<p>While I do have access to great general technology related advice, this post is bound to bring people well versed in crawling.<p>My question is: in terms of crawling speed (and I know this is dependent of several factors) what&#x27;s a decent amount of pages a good crawler could do per day?<p>The crawler I built is doing about 120K pages per day which to our initial needs is not bad at all, but wonder if in the crawling world this is peanuts or a decent chunk of pages?
评论 #10222942 未加载
评论 #10224721 未加载
评论 #10222946 未加载
评论 #10224921 未加载
评论 #10222881 未加载
评论 #10223982 未加载
评论 #10224846 未加载
评论 #10223700 未加载
评论 #10223615 未加载
评论 #10223149 未加载
评论 #10223102 未加载
Animats超过 9 年前
It would be interesting to compare this Python approach with a Go goroutine approach. The main question is whether Go&#x27;s libraries handle massive numbers of connections well. Since Google wrote Go to be used internally, they probably do.
评论 #10223651 未加载
juddlyon超过 9 年前
Node is well-suited for this type of thing and there are numerous libraries to help.
rgacote超过 9 年前
Appreciate the in-depth description. Look forward to working through this in detail.