Making 1M requests with Python-aiohttp

123 pointsby dante9999about 9 years ago

11 comments

teromabout 9 years ago

Re the EADDRNOTAVAIL from socket.connect(),If you're connecting to 127.0.0.1:8080, then each connection from 127.0.0.1 is going to be assigned an ephemeral TCP source port. There are only a finite number of such ports available, on the order of ~30-50k, which limits the number of connections from a single address to a specific endpoint.If you're doing 100k TCP connections with 1k concurrent conections, it's feasible that you'll run into those limits, with TCP connections hanging around in some TIME_WAIT state after close().Not that this would be a documented errno for connect(), but it's the interpretation that makes sense..<a href="http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-not.html" rel="nofollow">http://www.toptip.ca/2010/02/linux-eaddrnotavail-address-not...</a> <a href="http://lxr.free-electrons.com/source/net/ipv4/inet_hashtables.c?v=4.4#L572" rel="nofollow">http://lxr.free-electrons.com/source/net/ipv4/inet_hashtable...</a>

评论 #11558331 未加载

评论 #11559364 未加载

tookerabout 9 years ago

I have a library for doing coordinated async IO in python that addresses some of the scheduling and resource contention issues hinted out in the later part of this post. It's called cellulario in reference to containing async IO mechanics inside a cell wall..<pre><code> https://github.com/mayfield/cellulario </code></pre> And an example of using it to manage a multi-tiered scheme where a first layer of IO requests seeds another layer and then you finally reduce all the responses..<pre><code> https://github.com/mayfield/ecmcli/blob/master/ecmcli/api.py#L456</code></pre>

评论 #11556821 未加载

sandGorgonabout 9 years ago

I really keep wishing that there would be benchmark comparisons of asyncio/aiohttp with gevent/python2 . Performance would be a killer reason to migrate immediately to Py3.What I suspect though is that asyncio is not all that better than gevent. Can someone correct me on this?

评论 #11556474 未加载

velox_ioabout 9 years ago

The 1 million in the title is misleading (1M per hour is nothing to write home about, only 278/sec). There are frameworks that are able hit 1M per minute plus (16,666/sec).

评论 #11558195 未加载

评论 #11557378 未加载

评论 #11559618 未加载

ben_jonesabout 9 years ago

Does anyone enjoy doing async work in python? I've done a few hobby projects and honestly I was yearning for javascript + async lib after awhile. As great as python is maybe we should yield async programming to the languages designed for it?

评论 #11556398 未加载

评论 #11556233 未加载

评论 #11556153 未加载

评论 #11556219 未加载

评论 #11558655 未加载

评论 #11557962 未加载

philippbabout 9 years ago

I'm the CTO at KeepSafe. We open sourced aiohttp.We wrote aiohttp for our production system. We build everything on aiohttp. In our production systems we constantly run more request then in the benchmark with business logic on each request.The main reason we like aiohttp a lot if that you we can write asynchronous code that reads like synchronous and does not have callbacks.

takedaabout 9 years ago

IMO you should place all requests within a single ClientSession().This will provide two benefits:1. You won't need to use a semaphore. To limit connections you will need to create a TCPConnection() object with limit set to the limit you used in the semaphore and pass it to the ClientSession() and aiohttp will not make more connections than the limit set (default behavior is to have unlimited number of connections).2. With single ClientSession(), aiohttp will make use of keep-alive (i.e. it will reuse same connections for next requests, but it will keep at most the limit of connections you set in TCPConnection() object).This should improve performance further, and (given sane limit) it'll also solve issue with "Cannot assign requested address" error.BTW: Even without limit set aiohttp will try to reduce number of connections open so it might still fix the connection error issue as long as individual requests don't take long. It's still good idea to set limit, just to be nice to the remote server.

nbadgabout 9 years ago

First off, awesome to see more benchmarks (even if it's just personal experimentation) for synchronous vs asyncio performance. I think the real argument for asyncio right now is that it makes it very easy for you to write extremely efficient code, even for hobbyist projects. Even though your experiment is only handling 320 req/s, that you were able to do that so quickly and with very, very little optimization is, I think, a testament to the potential for asyncio.Some pointers:The event loop is still a single thread and therefore subject to the GIL. That means that at any given time, only one coroutine is running in the loop. This is important for several reasons, but probably the most relevant are that1. within any given coroutine, execution flow will always be consistent between yield/await statements.2. synchronous calls within coroutines will block the entire event loop.3. most of asyncio was not written with thread safety in mindThat second one is really important. When you're doing file access, eg where you're doing "with open('frank.html', 'rb')", that's something you may want to consider moving into a run_in_executor call. That will block the coroutine, but it will return control to the event loop, allowing other connections to proceed.Also, more likely than not, the too many open files error is a result of you opening frank.html, not of sockets. I haven't run your code with asyncio in debug mode[1] to verify that, but that would be my intuition. You would probably handle more requests if you changed that -- I would do the file access in a run_in_executor with a max executor workers of 1000. If you want to surpass that, use a process pool instead of a threadpool, and you should be ready to go, though it's worth mentioning that disk IO is hardly ever cpu-bound, so I wouldn't expect you to get much performance boost otherwise.Also, the placement of your semaphore acquisition doesn't make any sense to me. I would create a dedicated coroutine like this:<pre><code> async def bounded_fetch(sem): async with sem: return (await fetch(url.format(i))) </code></pre> and modify the parent function like this:<pre><code> for i in range(r): task = asyncio.ensure_future(bounded_fetch(sem)) tasks.append(task) </code></pre> That being said, it also doesn't make any sense to me to have the semaphore in the client code, since the error is in the server code.[1] <a href="https://docs.python.org/3/library/asyncio-dev.html#debug-mode-of-asyncio" rel="nofollow">https://docs.python.org/3/library/asyncio-dev.html#debug-mod...</a>

评论 #11558966 未加载

henrywabout 9 years ago

Looks pretty interesting to do async on python. I once did something similar in node (async by default) with a few lines of code. I think I scraped 12 or 20 million real URLs in 8 hours on a $5 cloud VM. It was limited by network bandwidth.

azinman2about 9 years ago

"Everyone knows that asynchronous code performs better when applied to network operations"Ummm that seems a bit far reaching.

评论 #11556692 未加载

imaginenoreabout 9 years ago

1,000,000 requests in 52 minutes is just 320 req/sec.Am I missing something? What's so amazing about this?I just deployed some production feed that serves at 1955 requests/second on a cheap VPS in freaking PHP, one of the slowest languages out there.

评论 #11556498 未加载

评论 #11556413 未加载

评论 #11556293 未加载