Python performance: it’s not just the interpreter

318 pointsby kmodabout 5 years ago

32 comments

barrkelabout 5 years ago

The article is a fine example of incremental optimization of some Python, replacing constructs that the standard Python interpreter has overheads in executing with others which trigger fewer of those same overheads.The title isn't quite right, though. Boxing, method lookup, etc. come under "interpreter" too.There's a continuum of implementation between a naive interpreter and a full-blown JIT compiler, rather than a binary distinction. All interpreters beyond line-oriented things like shells convert source code into a different format more suitable for execution. When that format is in memory, it can be processed to different degrees to achieve different levels of performance.For example, if looking up "str" is an issue, an interpreter could cache the last function it got for "str" conditional on some invalidation token (e.g. "global_identifier_change_count"), so it doesn't in fact need to look it up every time.Boxing can be eliminated by using type-specific representations of the operations, and choosing different code paths depending on checking the actual type. Hoisting those type checks outside of loops is then a big win. Add inlining, and the hoisting logic can see more loops to move things out of. Inlining also gets rid of your argument passing overhead.None of this requires targeting machine code directly, you can optimize interpreter code, and in fact that's what the back end of retargetable optimizers looks like - intermediate representation is an interpretable format.Of course things get super-complex super-quickly, but that's the trade-off.

评论 #23242335 未加载

评论 #23237007 未加载

评论 #23236771 未加载

评论 #23237001 未加载

评论 #23238793 未加载

评论 #23239527 未加载

umviabout 5 years ago

I've found it often doesn't matter how fast or slow python is if the bottleneck is outside of python's control.For example, I wrote an lcov alternative in python called fastcov[0]. In a nutshell, it leverages gcov 9's ability to send json reports to stdout to generate a coverage report in parallel (utilizing all cores)Recently someone emailed me and advised that if I truly wanted speed, I needed to abandon python for a compiled language. I had to explain, however, that as far as I can tell, the current bottleneck isn't the python interpreter, but GCC's gcov. Python 3's JSON parser is fast enough that fastcov can parse and process a gcov JSON report before gcov can serialize the next JSON.So really, if I rewrote it in C++ using the most blisteringly fast JSON library I could find, it would just mean the program will spend more time blocking on gcov's output.In summary: profile your code to see where the bottlenecks are and then fix them. Python is "slow", yes, but often the bottlenecks are outside of python, so it doesn't matter anyway.[0] <a href="https://github.com/RPGillespie6/fastcov" rel="nofollow">https://github.com/RPGillespie6/fastcov</a>

评论 #23238023 未加载

评论 #23238504 未加载

评论 #23239059 未加载

评论 #23240813 未加载

评论 #23238213 未加载

评论 #23238083 未加载

sandGorgonabout 5 years ago

>impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.It is super unfortunate that Pypy struggles for funding every quarter. Funding a mil for Pypy by the top 3 python shops (Google, Dropbox, Instagram?) should be rounding error for these guys...and has the potential to pay off in hundreds of millions atleast (given the overall infrastructure spend).

评论 #23239142 未加载

stabblesabout 5 years ago

This is a fun benchmark in C++, where you can see that GCC has a more restrictive small string optimization. On my desktop the main python example runs in 3.1s. Then this code<pre><code> void example() { for (int64_t i = 1; i <= 20; ++i) { for (int64_t j = 1; j <= 1'000'000; ++j) { std::to_string(j); } } } </code></pre> runs with GCC in 2.0s and with clang in 133ms, so 15x faster.I've also benchmarked it in Julia:<pre><code> function example() for j = 1:20, i = 1:1_000_000 string(i) end end </code></pre> which runs in 592ms. Julia has no small string optimization and does have proper unicode support by default.None of the compilers can see that the loop can be optimized out.

评论 #23243265 未加载

评论 #23239508 未加载

评论 #23239694 未加载

评论 #23238444 未加载

FartyMcFarterabout 5 years ago

This article seems to be using a very specific definition of interpreter, which is perhaps not what most people think of when they hear "interpreter" ?If I understand correctly, they call the module generating Python opcodes from Python code the "interpreter", and everything else is a "runtime". But Python opcodes are highly specific to CPython, and they are themselves interpreted, right? Calling the former "interpreter" and the latter something else seems like an artificial distinction.Not only is this definition of "interpreter" strange, but their definition of "runtime" also seems strange; in other languages, the runtime typically refers to code that assists in very specific operations (for example, garbage collection), not code that executes dynamically generated code.

评论 #23238265 未加载

评论 #23239026 未加载

pedrovhbabout 5 years ago

> And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.Wow, that's pretty impressive. I never really got to use PyPy though, as it seems that for most programs either performance doesn't really matter (within a couple of orders of magnitude), or numpy/pandas is used, in which case the optimization in calling C outweighs any others.Can anyone share use cases for PyPy?

评论 #23237469 未加载

评论 #23237371 未加载

评论 #23238470 未加载

评论 #23238191 未加载

评论 #23243422 未加载

rurbanabout 5 years ago

I did similar studies about a decode ago for perl with similar results.But what he's missing are two much more important thing.1. smaller datastructures. They are way overblown, both the ops and the data. Compress, trade for smaller data and more simplier ops. In my latest VM I use 32 bit words for each and for each data.2. Inlining. A tellsign is when the calling convention (arg copying) is your biggest profiler contributor.Python's bytecode and optimizer is now much better than perl's, but it's still 2x slower than perl. Python has by far the slowest VM. All the object and method hooks are insane. Still no unboxing or optimize refcounting away, which brought php ahead of the game.

anentropicabout 5 years ago

When it says that "argument passing" was responsible for 31% of the time, do I understand right that we're talking about this line in the inner loop?<pre><code> str(i) </code></pre> ...and the time is spent packing i into a tuple (i,) and then unpacking it again?are keyword args faster? or they do the same but via dict instead of tuple I guess

评论 #23237074 未加载

no_gravityabout 5 years ago

I wanted to play with variations of the code. For that it is useful to make it output a "summary" so you know the variation you tried is computationally equivalent.For the first benchmark, I added a combined string length calcuclation:<pre><code> def main(): r = 0 for j in range(20): for i in range(1000000): r += len(str(i)) print(r) main() </code></pre> When I execute it:<pre><code> time python3 test.py </code></pre> I get 8.3s execution time.The PHP equivalent:<pre><code> <?php function main() { $r = 0; for ($j=0;$j<20;$j++) for ($i=0;$i<1000000;$i++) $r += strlen($i); print("$r\n"); } main(); </code></pre> When I execute it:<pre><code> time php test.php </code></pre> Finishes in 1.4s here. So about 6x faster.Executing the Python version via PyPy:<pre><code> time pypy test.py </code></pre> Gives me 0.49s. Wow!For better control, I did all runs inside a Docker container. Outside the container, all runs are about 20% faster. Which I also find interesting.Would like to see how the code performs in some more languages like Javascript, Ruby and Java.

评论 #23239422 未加载

评论 #23237674 未加载

apalmerabout 5 years ago

I think he is making the distinction between 3 different categories that readers are in general lumping into the 'interpreter':1) Python is executed by an interpreter (necessary overhead)2) Python as a language is so dynamic/flexible/erfonomic that it has to do things that have overhead (necessary complexity unless you change the language)3) the specific implementation of the interpreter achieves 1 and 2 in ways that can be significantly slower than necessarySeems he is pointing out that a lot of performance issues that are generally thought to be due to 1 and 2 are really 3

评论 #23238445 未加载

adsharmaabout 5 years ago

Static subset of <language>While many here correctly observe that the "too much dynamism" can become a performance bottleneck, one has to analyze the common subset of python that most people use and see how much dynamism is intrinsic to those use cases.Other languages like JS have tried a static subset (Static typescript is a good example), that can be compiled to other languages - usually C.Python has had RPython, but no one uses it outside of the compiler community.The argument here is that python doesn't have to be one language. It could be 2-3 with similar syntax catering to different use cases and having different linters.* A highly dynamic language that caters to "do this task in 10 mins of coding" use case. This could be used by data scientists and other data exploration use cases.* A static subset where performance is at a premium. Typically compiled down to another language. Strict typing is necessary. Performance sensitive and a large code base that lives for many years.* Some combination of the two (say for a template engine type use case).A problem with the static use case is that the typing system in python is incomplete. It doesn't have pattern matching and other tools needed to support algebraic data types. Newer languages such as swift, rust and kotlin are more competitive in this space.

chrisseatonabout 5 years ago

Kevin knows much more than I do about optimising Python, but aren't lots of the things listed as 'not interpreter overhead' only slow because they're being interpreted? For example you only need integer boxing in this loop because it's running in an interpreter. If it was compiled that would go away. So shouldn't we blame most of these things on 'being interpreted'?

评论 #23236875 未加载

mark-rabout 5 years ago

> The benchmark is converting numbers to strings, which in Python is remarkably expensive for reasons we'll get into.I was a bit disappointed that converting numbers to strings was the only thing he didn't actually discuss. I've discovered that the conversion function is unnecessarily slow, basically O(n^2) on the number of digits. This despite being based on an algorithm from Knuth.

jokoonabout 5 years ago

I'll never understand why there are so many fast, alternative python interpreters.Is the language correctness of the official interpreter causing lower performance? Does it prevent using some modules? What's at stake here?I'm planning to use python as a game scripting language but I hear so much about performance issues that it scares me to try learning how to use it in a project. I love python though.

评论 #23243103 未加载

throwaway894345about 5 years ago

> In this post I hope to show that while the interpreter adds overhead, it is not the dominant factor for even a small microbenchmark. Instead we'll see that dynamic features -- particularly, features inside the runtime -- are to blame.I'm probably uneducated here, but I don't understand the distinction between the runtime and the interpreter for an interpreted language? Isn't the interpreter the same as the runtime? What are the distinct responsibilities of the interpreter and the runtime? Is the interpreter just the C program that runs the loop while the runtime is the libpython stuff (or whatever it's called)?

评论 #23237128 未加载

评论 #23236899 未加载

评论 #23236824 未加载

评论 #23236749 未加载

评论 #23236779 未加载

erdewitabout 5 years ago

Replacing str(i) with the f-string f'{i}' lets it run about 2x faster.

inglorabout 5 years ago

The article is quite good but the Node.js example is wrong. OP is measuring dead code elimination and general overhead time.No strings get harmed in the process, run Node.js with --trace-opt to see what's happening.

评论 #23241112 未加载

nojitoabout 5 years ago

A potential issue with benchmarks like this is that there are instances where the initial findings don't scale.I would be interested to see how it does over an operation that takes 1 minute, 5 minutes, 10 minutes.

aldanorabout 5 years ago

Since people are talking about speed and JIT here, it's worth mentioning Numba (<a href="http://numba.pydata.org" rel="nofollow">http://numba.pydata.org</a>). Being in the quant field myself, it's often been a lifesaver - you can implement a c-like algorithm in a notebook in a matter of seconds, parallelise it if needed and get full numpy api for free. Often times if you're doing something very specific, you can beat pandas/numpy versions of the same thing by an order of magnitude.

eggsnbacon1about 5 years ago

for reference, a pure java version through JMH that takes 0.38 seconds on my machine. This uses parallel stream so its multithreaded.Single threaded it takes 0.71 seconds. Removing the blackhole to allow dead code elimination takes single thread down to 0.41 seconds. This is close to PyPy, which I assume is dead code eliminating the string conversion as well.<pre><code> package org.example; import org.eclipse.collections.api.block.procedure.primitive.IntProcedure; import org.eclipse.collections.impl.list.Interval; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.infra.Blackhole; public class MyBenchmark { @Benchmark public void testMethod(final Blackhole blackhole) { Interval.oneTo(20) .parallelStream().forEach((IntProcedure) outer -> Interval.oneTo(1000000) .forEach( (IntProcedure) inner -> // prevent dead code elimination blackhole.consume(Integer.toString(inner)))); } }</code></pre>

评论 #23238228 未加载

tuananhabout 5 years ago

Can anyone explain to me why this is a lot faster than j.toString() or String(j)<pre><code> for (let i = 0; i < 20; i++) { for (let j = 0; j < 1000000; j++) { `${j}` } } </code></pre> I got<pre><code> Executed in 177.12 millis fish external usr time 159.12 millis 91.00 micros 159.03 millis sys time 17.96 millis 443.00 micros 17.52 millis</code></pre>

hpcjoeabout 5 years ago

I know this is supposed to be about python optimization. However, the post switches over to C at nearly the beginning of the process. Hence it really is about how to optimize python applications, by rewriting them in C (or other fast languages).Which IMO, isn't about optimizing Python, apart from tangential API/library usage.I've been under the impression that I'll get the best performance out of a language when I write code that leverages the best idiomatic features, and natural aspects of the language. If I have to resort to another language along the way to get needed performance, I guess the question is, isn't this a strong signal that I should be looking at different languages ... specifically fast ones?Most of the fast elements of python are interfaces to C/Fortran code (numpy, pandas, ...). What is the rationale for using a slow language as glue versus using a faster language for processing?

评论 #23239758 未加载

varelazabout 5 years ago

I really sick of such kind of benchmarks. I never have seen a real world python program that doesn't depend on any IO or doesn't have some C code behind wrapped calls. If your code is heavy with computations there are numpy/scipy libs that are very good at this. These optimizations bring < 10% of speed to real project/programm, but will require a lot of developers time to support it. If performance is the key feature and very critical, then likely python is not the right choice, because python is more about flexibility, ability to maintain and write solid, easy to read code.

评论 #23238773 未加载

评论 #23239006 未加载

earthboundkidabout 5 years ago

I would be interested in seeing the performance difference for f"{i}". My intuition is that it would be faster.

drcongoabout 5 years ago

That was really interesting, even though a lot of it (almost all the C) was way over my head. Thank you.

carapaceabout 5 years ago

Site broken, "Hug of Death"?> This page isn’t working> blog.kevmod.com is currently unable to handle this request.> HTTP ERROR 500

评论 #23236583 未加载

Mikhail_Kabout 5 years ago

On my somewhat old Linux machine his main() takes 5.22 seconds. Meanwhile, this Julia code<pre><code> @time map(string, 1:1000000); </code></pre> reports execution time 0.18 seconds. But that includes compilation time, if you use BenchmarkTools that runs the code repeatedly, I get 88.6 milliseconds

评论 #23237815 未加载

g8ozabout 5 years ago

I'd be interested in seeing how PHP performs running equivalent code.

评论 #23239353 未加载

评论 #23236325 未加载

dzongaabout 5 years ago

this is an area, nim could've come and improved on. i.e if nim had an ability to port your python code and have it running 98% on nim. most python users would have been there already.

oneiftwoabout 5 years ago

I've always assumed array iteration is more expensive you don't know the size of the objects and/or they aren't contiguous.

antb123about 5 years ago

hmm so complain and then say pypy is 7 times faster(and 4 times faster than nodejs)

评论 #23238116 未加载

andybakabout 5 years ago

[EDIT - posted in haste. I should RTFA]Ctrl+F javascript - nothing.At first glance this seems to be "dynamism==slow" but surely you need to explain why Python is slower than Javascript and for many years has resisted a lot of effort to match the performance of v8 and it's cousins?

评论 #23237246 未加载