How many CPU cores can you use in parallel?

183 点作者 itamarst超过 1 年前

17 条评论

dahart超过 1 年前

> our faster function could take advantage of no more than 8 cores; beyond that it started slowing down. Perhaps it started hitting some bottleneck other than computation, like memory bandwidth.@itamarst Yes, this in interesting, you should profile it and get to the bottom of the issue! It seems like in my experience that being limited by hyperthreading or instruction-level parallism is relatively rare, and much more often it’s cache or memory access patterns or implicit synchronization or contention for a hardware resource. There’s a good chance you’ll learn something useful by figuring it out. Maybe it’s memory bus contention, maybe it’s cache, maybe numba compiled in something you aren’t expecting.Worth nothing that using 20 on the fast test isn’t that much slower than using 8. A good first guess/proxy for number of threads to use is the number of cores, and that pays off in this case compared to using too few cores.Out of curiosity, do you know if your images are stored row-major or column major? I see the outer loop over shape[0] and inner loop over shape[1]. Is the compiled code stepping in memory by 1 pixel at time, or by a whole column? If your stride is a column, you may be thrashing the cache.I’d also be curious to hear how the speed of this compiled code compares to a numpy or PIL image threshold operation, if you happen to know.

评论 #38697870 未加载

评论 #38700409 未加载

评论 #38699426 未加载

onetimeuse92304超过 1 年前

Personally, I dislike configuration parameters that let future admins of the system change parameters like concurrency of certain processes, etc.A lot of the time even I can't tell what is going to be the optimum setting.So for a number of years, rather than expose a configuration parameter, I am building in a performance test that establishes the values of these parameters.This is a little bit of search and a hill climb through the multidimensional space of all relevant parameters. It is not perfect, but in my experience it is almost always better than I can do manually and always better than an operator without intimate understanding of the system.The results are cached in a database and can be re-executed if change to configuration is found (just take a number of parameters from your environment, version of the application, etc. and hash the value and check if you already have parameters in the database).

评论 #38695817 未加载

评论 #38695840 未加载

评论 #38704475 未加载

评论 #38698302 未加载

评论 #38695718 未加载

gnufx超过 1 年前

Use hwloc[1] to determine the hardware topology of the system, and pin processes/threads appropriately. For a heterogeneous system, you presumably want to avoid low-performance cores. OpenMP has a standard way of assigning threads to cores. Otherwise use hwloc, or perhaps likwid[2] explicitly.Most things turn out to be memory-limited, and getting the algorithm right may win more than simple parallelism; GEMM is a case in point. As ever, profile first, and worry about thread numbers after you've pinned appropriately and looked for simple things like initialization interacting with memory policy.The main HPC performance tools support Python and native code; Threadspotter was good for threading/cache analysis, but seems moribund. Note that SMT on POWER, for instance, is different from x86.(This is general HPC performance engineering; I haven't considered the code in point.)1. <a href="https://www.open-mpi.org/projects/hwloc/" rel="nofollow noreferrer">https://www.open-mpi.org/projects/hwloc/</a>2. <a href="https://github.com/rrze-likwid/likwid">https://github.com/rrze-likwid/likwid</a>

dale_glass超过 1 年前

> But there is something unexpected too: the optimal number of threads is different for each function.Nothing unexpected there. Amdahl's Law in its glory.A fast running function finishes fast, and coordinating the job's execution over many cores requires doing work every time something finishes. If you split your job into chunks in the microsecond time range, then you'll be handling lots and lots of tiny minutia.You want to set up your tasks such that each thread has a good amount of stuff to chew through before it needs to communicate with anything.

评论 #38695791 未加载

评论 #38698461 未加载

评论 #38700054 未加载

mihaic超过 1 年前

Starting from Python 3.13 there should be a new method os.process_cpu_count() that aims to get the actually available number of cores our process can run on.

评论 #38695824 未加载

mrlonglong超过 1 年前

More cores are excellent for building large projects. My threadripper is worth every penny in software development.

评论 #38697017 未加载

shanemhansen超过 1 年前

The problem of "how many CPUs should I use?" is really only answerable empirically in the general case.The problem of "how many CPUs are available?" is a little more tractable. Currently when running podman, the cpus allocated seems to be available in the /sys fs. I wonder if it's the same under k8s?<pre><code> podman run --cpus=3.5 -it docker.io/library/alpine:latest /bin/sh / # cat /sys/fs/cgroup/cpu.max 350000 100000</code></pre>

评论 #38701232 未加载

dr_kiszonka超过 1 年前

PythonSpeed is one of my favorite websites. However, this article leaves me with more questions than answers. (I do appreciate the benchmarking code.) For example, at one point, the author mentions that hyper-threading can be disabled in BIOS. Should I disable it? Based on the author's own description, it sounds like hyper-threading is pretty useful.

评论 #38701303 未加载

评论 #38701340 未加载

评论 #38701493 未加载

jjslocum3超过 1 年前

I understand this is a Python-centric source, but without having done my homework I'd have thought Python wouldn't be a particularly great language for dealing with these low level concerns. Wouldn't it be much easier in C? In java it's as simple as Runtime.getRuntime().availableProcessors()

评论 #38699046 未加载

Kon-Peki超过 1 年前

> you can spend a little bit of time at the beginning to empirically measure the optimal number of threads, perhaps with some heuristics to compensate for noise.I'd like to point out that there is a lot of stuff published on ArXiv; this appears to be a very active research subject. Don't start from scratch :)

1letterunixname超过 1 年前

Remember kids, classical x86 hyperthreading goes freeganning for additional under-scheduled execution units of each physical core to create another virtual core that's not going to be as fast as doubling the count of physical cores.

评论 #38704208 未加载

navels超过 1 年前

Related: fascinating deep dive of the Python GIL by David Beazley from PyCon2010: <a href="https://www.youtube.com/watch?v=Obt-vMVdM8s" rel="nofollow noreferrer">https://www.youtube.com/watch?v=Obt-vMVdM8s</a>

fizzynut超过 1 年前

I think the author has discovered memory bandwidth. When you have a simple function and just scale the number of cores it's easy to hit.

godelski超过 1 年前

One of the hard things about writing parallelized programs is that you may want different algorithms for different environments. The other day I joked about CUDA being harder than rocket science[0] and this is part of it.The thing with high parallelism is that you can't just think about your program in terms of clock and total memory. You have to consider all parts of the system, and this is likely the large reason most programs don't have high parallelism despite computers being many cores for decades. OpenMP is going to have different settings than OpenMPI. How you share the memory is an essential part of the algorithm you're going to write.There's also issues about real cores and hyperthreads/logical. I find that in most computation I never want to use the number of logical cores but only rely on physical cores. You can also get a double descent like behavior[1]. The author here seems to be running into an even newer problem that is the difference between performance cores and efficiency cores. I'm not actually sure that this is surprising and makes the headline feel clickbaity.For more, this is actually part of why CUDA is making a big boom lately. One of my first forays into CUDA was trying to write some kernels to speed up GEANT4 simulations, which are not too dissimilar from path tracing but they are more computationally heavy. So extremely parallelization but the operations per cycle wasn't high enough back then so it was still better to use CPU (I'm sure there were other bottlenecks as well but did find someone else confirming my results). This goes for lots of scientific compute too, like a lot of mesh solvers and FEA (finite element analysis: how you simulate physical parts). But not IPC has gone way up (AND cores increase!! AND bandwith! AND memory!) so now we start seeing these things start to leverage graphics cards a lot more.In HPC and it is worth noting that the big bottleneck actually isn't compute, it's throughput. More data can be created than can be processed. Many teams are switching to "in situ" visualization/analysis methods where you offload your data to another machine which does that processing. You also simulate at FAR higher resolution than you visualize or even analyze. So if you're interested in tackling problems that are becoming more and more important, this is one of them and involves both hardware and software.[0] <a href="https://news.ycombinator.com/item?id=38678066">https://news.ycombinator.com/item?id=38678066</a>[1] Say you have 16 physical cores with hyperthreading so 32 logical cores. Your program speeds up as number of processes -> 16. But when you go to 17 cores you have a big jump (loss in speedup) and follow a shallower curve as number of processes -> 32.Edit: If you're not on an HPC machine, I actually might suggest setting your max parallelism to (number_of_physical_cores - 1) because speed difference often is trivial (lots of asterisks here) and you have a core available to kill the process. More noob you are, the stronger this recommendation.

kristjansson超过 1 年前

> (you can also use perfplot, but note it’s GPL-licensed)Surly using a GPL-licensed development tool cannot affect the licensing of the project it's being used to develop?

评论 #38700194 未加载

评论 #38700061 未加载

评论 #38700879 未加载

omgtehlion超过 1 年前

> Intel i7-12700K processorWait! You can’t benchmark on a CPU with unpredictable speed and mix of slow and fast cores. All kinds of effects come into play here, and the code itself is not the most prominent among them.To measure which _code_ is better, you should use real SMP machine with fixed clock speed and turned off HT. On the machine from TFA you are just fighting with Intel’s thermal smarts and OS scheduling shenanigans. (edit: you can use the same machine, but configure it in BIOS. I, myself, use i9-12900k fixed to 5ghz@8P-cores as "Intel testing machine")

评论 #38698338 未加载

评论 #38698455 未加载

评论 #38699445 未加载

评论 #38700042 未加载

BerislavLopac超过 1 年前

> Last updated 15 Dec 2023, originally created 18 Dec 2023I respect Itamar even more after realising that he has invented the time machine... :o

评论 #38695383 未加载

评论 #38698092 未加载

评论 #38697237 未加载