TechEcho

6 comments

pixelmonkeyabout 9 years ago

Guido van Rossum, the creator of Python, wrote a web crawler as a motivating example for asyncio. You can find the code for it here:<a href="https://github.com/aosabook/500lines/tree/master/crawler" rel="nofollow">https://github.com/aosabook/500lines/tree/master/crawler</a>And a detailed post about its design, co-written with A. Jesse Jiryu Davis, here:<a href="http://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html" rel="nofollow">http://aosabook.org/en/500L/a-web-crawler-with-asyncio-corou...</a>

kmike84about 9 years ago

I was investigating how to add asyncio / async def support to Scrapy (see <a href="https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616" rel="nofollow">https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...</a>). Small examples like the one at the link look neat, but it is not all roses as you go further. The problems are not specific to Scrapy; I think any advanced `async def` based crawler will face them.There are 2 challenges with async def I don't know how to solve elegantly:1. how to integrate coroutine-based scraping code with on-disk persistent request queues;2. how to deallocate resources without boilerplate in coroutine-based scraping code.(1) is easier with callbacks-as-methods because this way state is passed explicitly (it is not in local variables), so Scrapy can choose to save it to disk.Example of (2) is this code:<pre><code> async def parse(self, response): resp = await self.fetch(url) # ... find another URL to follow # Here we have the problem: # response object is kept in memory # until second response is fully received. # This is a problem if 10s and 100s # of requests are processed in parallel # and responses are large. # Because of refcounting, with callbacks # response would have been kept in # memory only until second request # starts - callbacks+refcounting provide # an elegant way for resource deallocation. resp = await self.fetch(url2) </code></pre> If anyone has suggestions please comment on <a href="https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616" rel="nofollow">https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...</a>.

评论 #11337414 未加载

zedpmabout 9 years ago

This example isn't really making use of asyncio. asyncio.run_until_complete() is a blocking method (note that you don't use await when calling it, as it's not a coroutine.) You'd want to use something like asyncio.wait() with multiple futures to achieve some concurrency.

评论 #11335387 未加载

takedaabout 9 years ago

While you're using AsyncIO, your requests are still done serially due to using loop.run_until_complete().

dhamabout 9 years ago

What is the advantage of this over say using threads? Web scraping is pretty much all IO so you get big wins using threads in Python and Ruby.

评论 #11330462 未加载

评论 #11330112 未加载

aarontabout 9 years ago

Here's a proper example written for 3.4+: <a href="https://gist.github.com/madjar/9312452" rel="nofollow">https://gist.github.com/madjar/9312452</a>

6 comments

pixelmonkeyabout 9 years ago

kmike84about 9 years ago

评论 #11337414 未加载

zedpmabout 9 years ago

评论 #11335387 未加载

takedaabout 9 years ago

While you're using AsyncIO, your requests are still done serially due to using loop.run_until_complete().

dhamabout 9 years ago

What is the advantage of this over say using threads? Web scraping is pretty much all IO so you get big wins using threads in Python and Ruby.

评论 #11330462 未加载

评论 #11330112 未加载

aarontabout 9 years ago

Here's a proper example written for 3.4+: <a href="https://gist.github.com/madjar/9312452" rel="nofollow">https://gist.github.com/madjar/9312452</a>

Show HN: Python 3.5 Async Web Crawler Example

6 comments

Show HN: Python 3.5 Async Web Crawler Example

6 comments