Guido van Rossum, the creator of Python, wrote a web crawler as a motivating example for asyncio. You can find the code for it here:<p><a href="https://github.com/aosabook/500lines/tree/master/crawler" rel="nofollow">https://github.com/aosabook/500lines/tree/master/crawler</a><p>And a detailed post about its design, co-written with A. Jesse Jiryu Davis, here:<p><a href="http://aosabook.org/en/500L/a-web-crawler-with-asyncio-coroutines.html" rel="nofollow">http://aosabook.org/en/500L/a-web-crawler-with-asyncio-corou...</a>
I was investigating how to add asyncio / async def support to Scrapy (see <a href="https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616" rel="nofollow">https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...</a>). Small examples like the one at the link look neat, but it is not all roses as you go further. The problems are not specific to Scrapy; I think any advanced `async def` based crawler will face them.<p>There are 2 challenges with async def I don't know how to solve elegantly:<p>1. how to integrate coroutine-based scraping code with on-disk persistent request queues;<p>2. how to deallocate resources without boilerplate in coroutine-based scraping code.<p>(1) is easier with callbacks-as-methods because this way state is passed explicitly (it is not in local variables), so Scrapy can choose to save it to disk.<p>Example of (2) is this code:<p><pre><code> async def parse(self, response):
resp = await self.fetch(url)
# ... find another URL to follow
# Here we have the problem:
# response object is kept in memory
# until second response is fully received.
# This is a problem if 10s and 100s
# of requests are processed in parallel
# and responses are large.
# Because of refcounting, with callbacks
# response would have been kept in
# memory only until second request
# starts - callbacks+refcounting provide
# an elegant way for resource deallocation.
resp = await self.fetch(url2)
</code></pre>
If anyone has suggestions please comment on <a href="https://github.com/scrapy/scrapy/issues/1144#issuecomment-141843616" rel="nofollow">https://github.com/scrapy/scrapy/issues/1144#issuecomment-14...</a>.
This example isn't really making use of asyncio. asyncio.run_until_complete() is a blocking method (note that you don't use await when calling it, as it's not a coroutine.) You'd want to use something like asyncio.wait() with multiple futures to achieve some concurrency.
Here's a proper example written for 3.4+: <a href="https://gist.github.com/madjar/9312452" rel="nofollow">https://gist.github.com/madjar/9312452</a>