> our faster function could take advantage of no more than 8 cores; beyond that it started slowing down. Perhaps it started hitting some bottleneck other than computation, like memory bandwidth.<p>@itamarst Yes, this in interesting, you should profile it and get to the bottom of the issue! It seems like in my experience that being limited by hyperthreading or instruction-level parallism is relatively rare, and much more often it’s cache or memory access patterns or implicit synchronization or contention for a hardware resource. There’s a good chance you’ll learn something useful by figuring it out. Maybe it’s memory bus contention, maybe it’s cache, maybe numba compiled in something you aren’t expecting.<p>Worth nothing that using 20 on the fast test isn’t that much slower than using 8. A good first guess/proxy for number of threads to use is the number of cores, and that pays off in this case compared to using too few cores.<p>Out of curiosity, do you know if your images are stored row-major or column major? I see the outer loop over shape[0] and inner loop over shape[1]. Is the compiled code stepping in memory by 1 pixel at time, or by a whole column? If your stride is a column, you may be thrashing the cache.<p>I’d also be curious to hear how the speed of this compiled code compares to a numpy or PIL image threshold operation, if you happen to know.