A good mental model is to compare the number of floats being processed -vs- the number of primitive computations. Matrix multiplication has n^3 computation with n^2 data. Multiplication of large matrices is therefore special in the potential for "data re-use" (each float is used against all the columns or rows of the other matrix) -- so systems are designed to have a much higher flops throughput than memory bandwidth. A dot product is at the other extreme, where each float is used only once (loosely).<p>Roofline plots [1] are framework to visualize system design from this perspective.<p>[1] <a href="https://en.wikipedia.org/wiki/Roofline_model" rel="nofollow">https://en.wikipedia.org/wiki/Roofline_model</a>
> Each version is severely memory bandwidth bottlenecked, the CUDA version suffers the most with its practical 11.8 GB/s device-to-host bandwidth due to its PCI-Express 3.0 x16 interface.<p>PCIe 3.0? What?<p><a href="https://cowfreedom.de/#appendix/computer_specs/" rel="nofollow">https://cowfreedom.de/#appendix/computer_specs/</a><p>> GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)<p>> Intel Core i5-8300H (2020)<p>This is a low-price 8 year old GPU and a 4 year old CPU. And he seems to be including loading the data to GPU. Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.<p>Also he is comparing his own CUDA implementation. He should use one of the many available in CUBLAS/CUTLASS. Making a good CUDA GEMM is a very difficult art and very hardware specific
There is the compute vs communicate ratio.<p>For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.<p>For problems like dot product, it costs N to communicate but only N operations to calculate.<p>Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.<p>You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.
Comparing multicore wide AVX to CUDA is a bit of an unnecessary nuance for most folks. These make sense, but miss the forest from the trees:<p>- Either way, you're writing 'cuda style' fine-grained data parallel code that looks and works very different from regular multithreaded code. You are now in a different software universe.<p>- You now also have to think about throughput, latency hiding, etc. Nvidia has been commoditizing throughput-oriented hardware a lot better than others, and while AMD is catching up on some workloads, Nvidia is already advancing. This is where we think about bandwidth between network/disk=>compute unit. My best analogy here, when looking at things like GPU Direct Storage/Network, is CPU systems feel like a long twisty straw, while GPU paths are fat pipes. Big compute typically needs both compute + IO, and hardware specs tell you the bandwidth ceiling.<p>To a large extent, ideas are cross-polinating -- CPUs looking more like GPUs, and GPUs getting the flexibility of CPUs -- but either way, you're in a different universe of how code & hardware works than 1990s & early 2000s intel.
TBH, I'm finding that people underestimate the usefulness of CPU in both inference and fine tuning. PEFT with access to 64GB+ RAM and lots of cores can sometimes be cost effective.
For simple operations like the dot product (that also map extremely well to SIMD operations), yes, the CPU is often better, as there not much actual "computation" being done. More complex computations where the data does not need to transfer between the host and device amortize that transfer cost across multiple operations, and the balance can quickly weigh in favor of the GPU.
It's the relative difference of transfer overhead vs degree of compute. For one single operation, sure, the transfer overhead dominates. Add multiple compute steps (operations) however, and experiments will show that the GPU is faster as the transfer cost is fixed.
Would be interesting to see what a unified memory setup can do like say an Apple M-series since this is the argument for unified memory, zero copy memory access between CPU and GPU.
Question from someone who doesn't know enough about GPUs:
Recently a friend mentioned his workstation has 384 cores using 4 processors. This is starting to approach some of the core numbers of earlier GPUs.<p>Is there a possibility that in the not too distant future that GPUs and CPUs will just converge? Or are the tasks done by GPUs too specialized?
To save some time for people, the title is absolutely incorrect. The GPU is significantly faster for this test, but the author is measuring PCIe bandwidth from a PCIe generation of 10 years ago. If instead they had used pcie5 the bandwidth would be quadrupled.
how can the multicore AVX implementation do a dot product (for arrays much larger than cache) at 340 GB/s on a system with RAM bandwidth < 50 GB/s