科技回声

7 条评论

Great tutorial, I would love to see this rewritten with the new async / await syntax in python 3.5

评论 #10236545 未加载

One question I have about this - and I might have missed in article is - I'm all for using asyncio to make HTTP requests. But I see they apparently also use asyncio for "parse_links". Since parselinks should be CPU op, would it make sense to use fibers to download links and pass them into a thread pool to actually parse them//add to queue?I'm messing around with some of the ParallelUniverse Java fiber implementation and what I do is spam fibers to download pages and send the String response over to another fiber over a channel that maintains a thread pool to parse response body as they come in//create new fibers to read these links.I'm really just doing this to get more familiar with async programming and specifically the paralleluniverse Java libs but one thing I'm struggling a bit with is how to best make it well behaved (e.g right now there's no bound on number of outstanding HTPT requests).

Schwolop超过 9 年前

This article is way more important than the web crawler example used to motivate it. It's easily the single best thing I've ever read on asyncio, and I've been using it in anger for a year now. I've passed it around my team, and will be recommending it far and wide!

fabiandesimone超过 9 年前

I'm working on a project that involves lot's of web crawling. I'm not technical at all (I'm hiring freelancers).While I do have access to great general technology related advice, this post is bound to bring people well versed in crawling.My question is: in terms of crawling speed (and I know this is dependent of several factors) what's a decent amount of pages a good crawler could do per day?The crawler I built is doing about 120K pages per day which to our initial needs is not bad at all, but wonder if in the crawling world this is peanuts or a decent chunk of pages?

评论 #10222942 未加载

评论 #10224721 未加载

评论 #10222946 未加载

评论 #10224921 未加载

评论 #10222881 未加载

评论 #10223982 未加载

评论 #10224846 未加载

评论 #10223700 未加载

评论 #10223615 未加载

评论 #10223149 未加载

评论 #10223102 未加载

Animats超过 9 年前

It would be interesting to compare this Python approach with a Go goroutine approach. The main question is whether Go's libraries handle massive numbers of connections well. Since Google wrote Go to be used internally, they probably do.

评论 #10223651 未加载

juddlyon超过 9 年前

Node is well-suited for this type of thing and there are numerous libraries to help.

rgacote超过 9 年前

Appreciate the in-depth description. Look forward to working through this in detail.

7 条评论

theVirginian超过 9 年前

Great tutorial, I would love to see this rewritten with the new async / await syntax in python 3.5

评论 #10236545 未加载

potatosareok超过 9 年前

Schwolop超过 9 年前

fabiandesimone超过 9 年前

评论 #10222942 未加载

评论 #10224721 未加载

评论 #10222946 未加载

评论 #10224921 未加载

评论 #10222881 未加载

评论 #10223982 未加载

评论 #10224846 未加载

评论 #10223700 未加载

评论 #10223615 未加载

评论 #10223149 未加载

评论 #10223102 未加载

Animats超过 9 年前

评论 #10223651 未加载

juddlyon超过 9 年前

Node is well-suited for this type of thing and there are numerous libraries to help.

rgacote超过 9 年前

Appreciate the in-depth description. Look forward to working through this in detail.

A Web Crawler with Asyncio Coroutines

7 条评论

A Web Crawler with Asyncio Coroutines

7 条评论