The article is a fine example of incremental optimization of some Python, replacing constructs that the standard Python interpreter has overheads in executing with others which trigger fewer of those same overheads.<p>The title isn't quite right, though. Boxing, method lookup, etc. come under "interpreter" too.<p>There's a continuum of implementation between a naive interpreter and a full-blown JIT compiler, rather than a binary distinction. All interpreters beyond line-oriented things like shells convert source code into a different format more suitable for execution. When that format is in memory, it can be processed to different degrees to achieve different levels of performance.<p>For example, if looking up "str" is an issue, an interpreter could cache the last function it got for "str" conditional on some invalidation token (e.g. "global_identifier_change_count"), so it doesn't in fact need to look it up every time.<p>Boxing can be eliminated by using type-specific representations of the operations, and choosing different code paths depending on checking the actual type. Hoisting those type checks outside of loops is then a big win. Add inlining, and the hoisting logic can see more loops to move things out of. Inlining also gets rid of your argument passing overhead.<p>None of this requires targeting machine code directly, you can optimize interpreter code, and in fact that's what the back end of retargetable optimizers looks like - intermediate representation is an interpretable format.<p>Of course things get super-complex super-quickly, but that's the trade-off.
I've found it often doesn't matter how fast or slow python is if the bottleneck is outside of python's control.<p>For example, I wrote an lcov alternative in python called fastcov[0]. In a nutshell, it leverages gcov 9's ability to send json reports to stdout to generate a coverage report in parallel (utilizing all cores)<p>Recently someone emailed me and advised that if I truly wanted speed, I needed to abandon python for a compiled language. I had to explain, however, that as far as I can tell, the current bottleneck isn't the python interpreter, but GCC's gcov. Python 3's JSON parser is fast enough that fastcov can parse and process a gcov JSON report <i>before</i> gcov can serialize the next JSON.<p>So really, if I rewrote it in C++ using the most blisteringly fast JSON library I could find, it would just mean the program will spend more time blocking on gcov's output.<p>In summary: profile your code to see where the bottlenecks are and then fix them. Python is "slow", yes, but often the bottlenecks are outside of python, so it doesn't matter anyway.<p>[0] <a href="https://github.com/RPGillespie6/fastcov" rel="nofollow">https://github.com/RPGillespie6/fastcov</a>
><i>impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.</i><p>It is super unfortunate that Pypy struggles for funding every quarter. Funding a mil for Pypy by the top 3 python shops (Google, Dropbox, Instagram?) should be rounding error for these guys...and has the potential to pay off in hundreds of millions atleast (given the overall infrastructure spend).
This is a fun benchmark in C++, where you can see that GCC has a more restrictive small string optimization. On my desktop the main python example runs in 3.1s. Then this code<p><pre><code> void example() {
for (int64_t i = 1; i <= 20; ++i) {
for (int64_t j = 1; j <= 1'000'000; ++j) {
std::to_string(j);
}
}
}
</code></pre>
runs with GCC in 2.0s and with clang in 133ms, so 15x faster.<p>I've also benchmarked it in Julia:<p><pre><code> function example()
for j = 1:20, i = 1:1_000_000
string(i)
end
end
</code></pre>
which runs in 592ms. Julia has no small string optimization and does have proper unicode support by default.<p>None of the compilers can see that the loop can be optimized out.
This article seems to be using a very specific definition of interpreter, which is perhaps not what most people think of when they hear "interpreter" ?<p>If I understand correctly, they call the module generating Python opcodes from Python code the "interpreter", and everything else is a "runtime". But Python opcodes are highly specific to CPython, and they are themselves interpreted, right? Calling the former "interpreter" and the latter something else seems like an artificial distinction.<p>Not only is this definition of "interpreter" strange, but their definition of "runtime" also seems strange; in other languages, the runtime typically refers to code that assists in very specific operations (for example, garbage collection), not code that executes dynamically generated code.
> And impressively, PyPy [PyPy3 v7.3.1] only takes 0.31s to run the original version of the benchmark. So not only do they get rid of most of the overheads, but they are significantly faster at unicode conversion as well.<p>Wow, that's pretty impressive. I never really got to use PyPy though, as it seems that for most programs either performance doesn't really matter (within a couple of orders of magnitude), or numpy/pandas is used, in which case the optimization in calling C outweighs any others.<p>Can anyone share use cases for PyPy?
I did similar studies about a decode ago for perl with similar results.<p>But what he's missing are two much more important thing.<p>1. smaller datastructures.
They are way overblown, both the ops and the data. Compress, trade for smaller data and more simplier ops.
In my latest VM I use 32 bit words for each and for each data.<p>2. Inlining. A tellsign is when the calling convention (arg copying) is your biggest profiler contributor.<p>Python's bytecode and optimizer is now much better than perl's, but it's still 2x slower than perl. Python has by far the slowest VM. All the object and method hooks are insane. Still no unboxing or optimize refcounting away, which brought php ahead of the game.
When it says that "argument passing" was responsible for 31% of the time, do I understand right that we're talking about this line in the inner loop?<p><pre><code> str(i)
</code></pre>
...and the time is spent packing i into a tuple (i,) and then unpacking it again?<p>are keyword args faster? or they do the same but via dict instead of tuple I guess
I wanted to play with variations of the code. For that it is useful to make it output a "summary" so you know the variation you tried is computationally equivalent.<p>For the first benchmark, I added a combined string length calcuclation:<p><pre><code> def main():
r = 0
for j in range(20):
for i in range(1000000):
r += len(str(i))
print(r)
main()
</code></pre>
When I execute it:<p><pre><code> time python3 test.py
</code></pre>
I get 8.3s execution time.<p>The PHP equivalent:<p><pre><code> <?php
function main() {
$r = 0;
for ($j=0;$j<20;$j++)
for ($i=0;$i<1000000;$i++)
$r += strlen($i);
print("$r\n");
}
main();
</code></pre>
When I execute it:<p><pre><code> time php test.php
</code></pre>
Finishes in 1.4s here. So about 6x faster.<p>Executing the Python version via PyPy:<p><pre><code> time pypy test.py
</code></pre>
Gives me 0.49s. Wow!<p>For better control, I did all runs inside a Docker container. Outside the container, all runs are about 20% faster. Which I also find interesting.<p>Would like to see how the code performs in some more languages like Javascript, Ruby and Java.
I think he is making the distinction between 3 different categories that readers are in general lumping into the 'interpreter':<p>1) Python is executed by an interpreter (necessary overhead)<p>2) Python as a language is so dynamic/flexible/erfonomic that it has to do things that have overhead (necessary complexity unless you change the language)<p>3) the specific implementation of the interpreter achieves 1 and 2 in ways that can be significantly slower than necessary<p>Seems he is pointing out that a lot of performance issues that are generally thought to be due to 1 and 2 are really 3
Static subset of <language><p>While many here correctly observe that the "too much dynamism" can become a performance bottleneck, one has to analyze the common subset of python that most people use and see how much dynamism is intrinsic to those use cases.<p>Other languages like JS have tried a static subset (Static typescript is a good example), that can be compiled to other languages - usually C.<p>Python has had RPython, but no one uses it outside of the compiler community.<p>The argument here is that python doesn't have to be one language. It could be 2-3 with similar syntax catering to different use cases and having different linters.<p>* A highly dynamic language that caters to "do this task in 10 mins of coding" use case. This could be used by data scientists and other data exploration use cases.<p>* A static subset where performance is at a premium. Typically compiled down to another language. Strict typing is necessary. Performance sensitive and a large code base that lives for many years.<p>* Some combination of the two (say for a template engine type use case).<p>A problem with the static use case is that the typing system in python is incomplete. It doesn't have pattern matching and other tools needed to support algebraic data types. Newer languages such as swift, rust and kotlin are more competitive in this space.
Kevin knows much more than I do about optimising Python, but aren't lots of the things listed as 'not interpreter overhead' only slow because they're being interpreted? For example you only need integer boxing in this loop because it's running in an interpreter. If it was compiled that would go away. So shouldn't we blame most of these things on 'being interpreted'?
> The benchmark is converting numbers to strings, which in Python is remarkably expensive for reasons we'll get into.<p>I was a bit disappointed that converting numbers to strings was the only thing he didn't actually discuss. I've discovered that the conversion function is unnecessarily slow, basically O(n^2) on the number of digits. This despite being based on an algorithm from Knuth.
I'll never understand why there are so many fast, alternative python interpreters.<p>Is the language correctness of the official interpreter causing lower performance? Does it prevent using some modules? What's at stake here?<p>I'm planning to use python as a game scripting language but I hear so much about performance issues that it scares me to try learning how to use it in a project. I love python though.
> In this post I hope to show that while the interpreter adds overhead, it is not the dominant factor for even a small microbenchmark. Instead we'll see that dynamic features -- particularly, features inside the runtime -- are to blame.<p>I'm probably uneducated here, but I don't understand the distinction between the runtime and the interpreter for an interpreted language? Isn't the interpreter the same as the runtime? What are the distinct responsibilities of the interpreter and the runtime? Is the interpreter just the C program that runs the loop while the runtime is the libpython stuff (or whatever it's called)?
The article is quite good but the Node.js example is wrong. OP is measuring dead code elimination and general overhead time.<p>No strings get harmed in the process, run Node.js with --trace-opt to see what's happening.
A potential issue with benchmarks like this is that there are instances where the initial findings don't scale.<p>I would be interested to see how it does over an operation that takes 1 minute, 5 minutes, 10 minutes.
Since people are talking about speed and JIT here, it's worth mentioning Numba (<a href="http://numba.pydata.org" rel="nofollow">http://numba.pydata.org</a>). Being in the quant field myself, it's often been a lifesaver - you can implement a c-like algorithm in a notebook in a matter of seconds, parallelise it if needed and get full numpy api for free. Often times if you're doing something very specific, you can beat pandas/numpy versions of the same thing by an order of magnitude.
for reference, a pure java version through JMH that takes 0.38 seconds on my machine. This uses parallel stream so its multithreaded.<p>Single threaded it takes 0.71 seconds. Removing the blackhole to allow dead code elimination takes single thread down to 0.41 seconds. This is close to PyPy, which I assume is dead code eliminating the string conversion as well.<p><pre><code> package org.example;
import org.eclipse.collections.api.block.procedure.primitive.IntProcedure;
import org.eclipse.collections.impl.list.Interval;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.infra.Blackhole;
public class MyBenchmark {
@Benchmark
public void testMethod(final Blackhole blackhole) {
Interval.oneTo(20)
.parallelStream().forEach((IntProcedure)
outer ->
Interval.oneTo(1000000)
.forEach(
(IntProcedure)
inner ->
// prevent dead code elimination
blackhole.consume(Integer.toString(inner))));
}
}</code></pre>
Can anyone explain to me why this is a lot faster than j.toString() or String(j)<p><pre><code> for (let i = 0; i < 20; i++) {
for (let j = 0; j < 1000000; j++) {
`${j}`
}
}
</code></pre>
I got<p><pre><code> Executed in 177.12 millis fish external
usr time 159.12 millis 91.00 micros 159.03 millis
sys time 17.96 millis 443.00 micros 17.52 millis</code></pre>
I know this is supposed to be about python optimization. However, the post switches over to C at nearly the beginning of the process. Hence it really is about how to optimize python applications, by rewriting them in C (or other fast languages).<p>Which IMO, isn't about optimizing Python, apart from tangential API/library usage.<p>I've been under the impression that I'll get the best performance out of a language when I write code that leverages the best idiomatic features, and natural aspects of the language. If I have to resort to another language along the way to get needed performance, I guess the question is, isn't this a strong signal that I should be looking at different languages ... specifically fast ones?<p>Most of the fast elements of python are interfaces to C/Fortran code (numpy, pandas, ...). What is the rationale for using a slow language as glue versus using a faster language for processing?
I really sick of such kind of benchmarks. I never have seen a real world python program that doesn't depend on any IO or doesn't have some C code behind wrapped calls. If your code is heavy with computations there are numpy/scipy libs that are very good at this. These optimizations bring < 10% of speed to real project/programm, but will require a lot of developers time to support it. If performance is the key feature and very critical, then likely python is not the right choice, because python is more about flexibility, ability to maintain and write solid, easy to read code.
Site broken, "Hug of Death"?<p>> This page isn’t working<p>> blog.kevmod.com is currently unable to handle this request.<p>> HTTP ERROR 500
On my somewhat old Linux machine his main() takes 5.22 seconds. Meanwhile, this Julia code<p><pre><code> @time map(string, 1:1000000);
</code></pre>
reports execution time 0.18 seconds. But that includes compilation time, if you use BenchmarkTools that runs the code repeatedly, I get 88.6 milliseconds
this is an area, nim could've come and improved on. i.e if nim had an ability to port your python code and have it running 98% on nim. most python users would have been there already.
[EDIT - posted in haste. I should RTFA]<p>Ctrl+F javascript - nothing.<p>At first glance this seems to be "dynamism==slow" but surely you need to explain why Python is slower than Javascript and for many years has resisted a lot of effort to match the performance of v8 and it's cousins?