Here's how I would write this, in Cython using "pure C" arrays:<p><a href="https://gist.github.com/syllog1sm/3dd24cc8b0ad925325e1" rel="nofollow">https://gist.github.com/syllog1sm/3dd24cc8b0ad925325e1</a><p>It's getting 18,000 steps/second, in the same ballpark as your C code.<p>I prefer to "write C in Cython", because I find it easier to read than the numpy code. This may be my bias, though --- I've been writing almost nothing but Cython for about two years now.<p>Btw, if anyone's interested, "cymem" is a small library I have on pip. It's used to tie memory to a Python object's lifetime. All it does is remember what addresses it gave out, and when your Pool is garbage collected, it frees the memory.<p>Edit: GH fork, with code to compile and run the Cython version: <a href="https://github.com/syllog1sm/python-numpy-c-extension-examples" rel="nofollow">https://github.com/syllog1sm/python-numpy-c-extension-exampl...</a> . I hacked his script quickly.