科技回声

4 条评论

srean将近 11 年前

Places where this bites you is using numerical optimization algorithms in Python/Numpy, particularly when using gradient based methods. They make a HUGE boat load of calls to evaluate the objective function value and objective function gradient. Typically both are callback functions. Worse they do it in a very tight and CPU bound loop.This is one place where the much repeated "write your most expensive part in C or Fortran" does not give enough benefit unless you encapsulate the entire thing in C. Here I have assumed that the objective function and the gradient function has been exposed to Python's runtime through a C module.Although the time spent evaluating the function and evaluating the gradient can be reduced hugely by switching to C explicitly, or implicitly by using Numpy vectorized expressions, there is still this large overhead of making that many calls to a Python function. Same issue with JNI.Cython helps some here by encapsulating the entire thing in C, however if you have to call out to Python's or Numpy's C-API you are back to square one. This is important because if you want to write readable Cython (meaning something that looks like Numpy) that is free of low low level loops, it is very hard to avoid those API calls.I would assume that a JIT with a good inliner would help a lot here.In fact C itself does not fair well for this use case. This is one place where, ... wait for it.... a smart enough compiler for a high level language like Lisp can give a lot of speedup, example Stalin, and Stalingrad. You can do a lot with C++ and D's compile time metaprogramming machinery. C++ metaprogramming is quite painful if you are rolling your own. It was never designed for this use case. D is great but code-generation part of their toolchain arent that mature yet. Although it can be as fast as C++, one often has to desugar manually. Hopefully it will get better at it soon.Fortran is also supposed to beat C routinely here, but I am not quite sure why/how it does it. If you know why it turns out to be faster I would be eager to know. I think Fortran semantics restricts possible data dependency paths enough that the compiler can (a) do a lot of SIMD vectorization and (b) avoid reloading data from memory and work with cached values. I dont know whether Fortran semantics helps in inlining away function calls.

评论 #7887124 未加载

评论 #7887171 未加载

nkurz将近 11 年前

There are some issues that make the actual C overhead somewhat greater than you measured. It's still far in the lead, but not by quite as much as your numbers imply.The biggest problem is that modern GCC is smart enough to optimize away much of your recursion. If you compile with full optimization as you did and then inspect the 'recursor' function in your executable with 'objdump -d', you'll see that it's essentially unrolled your function 3 times. Instead of actually calling the function, it's just doing a comparison two times out of 3.The best way to defeat this and level the playing ground is probably to add "-fno-inline-functions" to the command line. When you do this, the program may well segfault rather than run because it's suddenly using 3 times the stack space. At least, that's what happened when I tried it. Depending on how your system is set up, you might be able to solve this with 'ulimit -s unlimited'.<pre><code> # Linux 3.11.0 x86_64 on Intel i7-4770 $ ulimit -s unlimited $ gcc-4.8 -march=native -Wall -O3 test-function-call-overhead.c -o overhead-original $ overhead-original test-function-call-overhead.c: 1917046018.4902 calls per second $ gcc-4.8 -fno-inline-functions -march=native -Wall -O3 test-function-call-overhead.c -o overhead-no-inline $ overhead-no-inline test-function-call-overhead.c: 402246812.8280 calls per second </code></pre> This would imply that our actual loop overhead for C is 5x greater than your numbers. But even this number isn't really accurate. If you run 'perf stat overhead-no-inline', you see that it is executing well under one instruction per cycle:<pre><code> $ perf stat overhead-no-inline 6,517,046,940 instructions # 0.62 insns per cycle </code></pre> There are a number of things that can cause this, but the most likely is that the run time is being limited by memory latency rather what one would think of as normal function overhead. What memory you might ask? Remember that bit about needing to adjust the stack size because we were exceeding the default? It turns out that the default 8 MB is about equal to the L3 cache on modern processors, and thus the stack is overflowing to the much slower RAM.To get a better measurement, we need to fiddle with your numbers for N and K. Each stack frame is probably about 32 bytes (I fiddled with your code to figure out what was happening and don't know the exact original), so if you want the stack to fit in the 32 KB L1, you'd be limited to about 1000 levels of recursion. But this is too fast for gettimeofday() to measure accurately in microseconds -- you'd want to switch to clock_gettime() which goes down to nanoseconds.But that was too much for me tonight, so I'll stop here. The final answer will depend on the exact structure of your loop, but should be somewhere in between the two numbers at about about 1 billion calls per second, which is to say about 3-4 cycles per function call, with the much of the overhead coming from the loop rather than the call overhead.

nostrademons将近 11 年前

This doesn't surprise me, but take a look at the absolute numbers. In ANSI C on whatever processor he used, you can make over a billion function calls per second. Even in Python, you can do about 2 million.Take a look at a profile of your program sometime. The vast majority of your functions get called maybe 20 times or less, with many of them being called only once (at startup) or once per operation (eg. setting up a request). Then there are obvious hot spots, eg. in many profiles I've run malloc() is by far the biggest single offender.The Python strategy is to use Python as a scripting & glue language, and drop down into C for the critical parts of your program. You can get really far just writing initialization code in Python, keeping your data structures in C, and writing the inner loop of your algorithm or the critical part of your request processing in C.

评论 #7886437 未加载

zatkin将近 11 年前

Source files:<pre><code> http://ciar.org/ttk/public/vs/call/test-function-call-overhead.c http://ciar.org/ttk/public/vs/call/test-function-call-overhead.pl http://ciar.org/ttk/public/vs/call/test-function-call-overhead.py</code></pre>

4 条评论

srean将近 11 年前

评论 #7887124 未加载

评论 #7887171 未加载

nkurz将近 11 年前

nostrademons将近 11 年前

评论 #7886437 未加载

zatkin将近 11 年前

ANSI-C function call performance is about 530x that of Python

4 条评论

ANSI-C function call performance is about 530x that of Python

4 条评论