TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Show HN: Python 3.5 Async Web Crawler Example

28 pointsby mehmetkoseabout 9 years ago

6 comments

pixelmonkeyabout 9 years ago
Guido van Rossum, the creator of Python, wrote a web crawler as a motivating example for asyncio. You can find the code for it here:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;aosabook&#x2F;500lines&#x2F;tree&#x2F;master&#x2F;crawler" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;aosabook&#x2F;500lines&#x2F;tree&#x2F;master&#x2F;crawler</a><p>And a detailed post about its design, co-written with A. Jesse Jiryu Davis, here:<p><a href="http:&#x2F;&#x2F;aosabook.org&#x2F;en&#x2F;500L&#x2F;a-web-crawler-with-asyncio-coroutines.html" rel="nofollow">http:&#x2F;&#x2F;aosabook.org&#x2F;en&#x2F;500L&#x2F;a-web-crawler-with-asyncio-corou...</a>
kmike84about 9 years ago
I was investigating how to add asyncio &#x2F; async def support to Scrapy (see <a href="https:&#x2F;&#x2F;github.com&#x2F;scrapy&#x2F;scrapy&#x2F;issues&#x2F;1144#issuecomment-141843616" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;scrapy&#x2F;scrapy&#x2F;issues&#x2F;1144#issuecomment-14...</a>). Small examples like the one at the link look neat, but it is not all roses as you go further. The problems are not specific to Scrapy; I think any advanced `async def` based crawler will face them.<p>There are 2 challenges with async def I don&#x27;t know how to solve elegantly:<p>1. how to integrate coroutine-based scraping code with on-disk persistent request queues;<p>2. how to deallocate resources without boilerplate in coroutine-based scraping code.<p>(1) is easier with callbacks-as-methods because this way state is passed explicitly (it is not in local variables), so Scrapy can choose to save it to disk.<p>Example of (2) is this code:<p><pre><code> async def parse(self, response): resp = await self.fetch(url) # ... find another URL to follow # Here we have the problem: # response object is kept in memory # until second response is fully received. # This is a problem if 10s and 100s # of requests are processed in parallel # and responses are large. # Because of refcounting, with callbacks # response would have been kept in # memory only until second request # starts - callbacks+refcounting provide # an elegant way for resource deallocation. resp = await self.fetch(url2) </code></pre> If anyone has suggestions please comment on <a href="https:&#x2F;&#x2F;github.com&#x2F;scrapy&#x2F;scrapy&#x2F;issues&#x2F;1144#issuecomment-141843616" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;scrapy&#x2F;scrapy&#x2F;issues&#x2F;1144#issuecomment-14...</a>.
评论 #11337414 未加载
zedpmabout 9 years ago
This example isn&#x27;t really making use of asyncio. asyncio.run_until_complete() is a blocking method (note that you don&#x27;t use await when calling it, as it&#x27;s not a coroutine.) You&#x27;d want to use something like asyncio.wait() with multiple futures to achieve some concurrency.
评论 #11335387 未加载
takedaabout 9 years ago
While you&#x27;re using AsyncIO, your requests are still done serially due to using loop.run_until_complete().
dhamabout 9 years ago
What is the advantage of this over say using threads? Web scraping is pretty much all IO so you get big wins using threads in Python and Ruby.
评论 #11330462 未加载
评论 #11330112 未加载
aarontabout 9 years ago
Here&#x27;s a proper example written for 3.4+: <a href="https:&#x2F;&#x2F;gist.github.com&#x2F;madjar&#x2F;9312452" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;madjar&#x2F;9312452</a>