The GPU is not always faster

219 pointsby CowFreedom5 months ago

19 comments

ssivark5 months ago

A good mental model is to compare the number of floats being processed -vs- the number of primitive computations. Matrix multiplication has n^3 computation with n^2 data. Multiplication of large matrices is therefore special in the potential for "data re-use" (each float is used against all the columns or rows of the other matrix) -- so systems are designed to have a much higher flops throughput than memory bandwidth. A dot product is at the other extreme, where each float is used only once (loosely).Roofline plots [1] are framework to visualize system design from this perspective.[1] <a href="https://en.wikipedia.org/wiki/Roofline_model" rel="nofollow">https://en.wikipedia.org/wiki/Roofline_model</a>

评论 #42390936 未加载

评论 #42396911 未加载

评论 #42389658 未加载

alecco5 months ago

> Each version is severely memory bandwidth bottlenecked, the CUDA version suffers the most with its practical 11.8 GB/s device-to-host bandwidth due to its PCI-Express 3.0 x16 interface.PCIe 3.0? What?<a href="https://cowfreedom.de/#appendix/computer_specs/" rel="nofollow">https://cowfreedom.de/#appendix/computer_specs/</a>> GeForce GTX 1050 Ti with Max-Q Design (PCIe 3.0 x16) (2016)> Intel Core i5-8300H (2020)This is a low-price 8 year old GPU and a 4 year old CPU. And he seems to be including loading the data to GPU. Newer cards have wide PCIe 5.0 or some faster interconnect, like Nvidia Grace-Hopper.Also he is comparing his own CUDA implementation. He should use one of the many available in CUBLAS/CUTLASS. Making a good CUDA GEMM is a very difficult art and very hardware specific

评论 #42390724 未加载

评论 #42397614 未加载

评论 #42391723 未加载

dragontamer5 months ago

There is the compute vs communicate ratio.For problems like Matrix Multiplication, it costs N to communicate the problem but N^2 operations to calculate.For problems like dot product, it costs N to communicate but only N operations to calculate.Compute must be substantially larger than communication costs if you hope to see any benefits. Asymptotic differences obviously help, but linear too might help.You'd never transfer N data to perform a log(n) binary search for example. At that point communication dominates.

评论 #42389910 未加载

评论 #42391156 未加载

lmeyerov5 months ago

Comparing multicore wide AVX to CUDA is a bit of an unnecessary nuance for most folks. These make sense, but miss the forest from the trees:- Either way, you're writing 'cuda style' fine-grained data parallel code that looks and works very different from regular multithreaded code. You are now in a different software universe.- You now also have to think about throughput, latency hiding, etc. Nvidia has been commoditizing throughput-oriented hardware a lot better than others, and while AMD is catching up on some workloads, Nvidia is already advancing. This is where we think about bandwidth between network/disk=>compute unit. My best analogy here, when looking at things like GPU Direct Storage/Network, is CPU systems feel like a long twisty straw, while GPU paths are fat pipes. Big compute typically needs both compute + IO, and hardware specs tell you the bandwidth ceiling.To a large extent, ideas are cross-polinating -- CPUs looking more like GPUs, and GPUs getting the flexibility of CPUs -- but either way, you're in a different universe of how code & hardware works than 1990s & early 2000s intel.

评论 #42391450 未加载

jsight5 months ago

TBH, I'm finding that people underestimate the usefulness of CPU in both inference and fine tuning. PEFT with access to 64GB+ RAM and lots of cores can sometimes be cost effective.

评论 #42393287 未加载

tylermw5 months ago

For simple operations like the dot product (that also map extremely well to SIMD operations), yes, the CPU is often better, as there not much actual "computation" being done. More complex computations where the data does not need to transfer between the host and device amortize that transfer cost across multiple operations, and the balance can quickly weigh in favor of the GPU.

评论 #42392615 未加载

glitchc5 months ago

It's the relative difference of transfer overhead vs degree of compute. For one single operation, sure, the transfer overhead dominates. Add multiple compute steps (operations) however, and experiments will show that the GPU is faster as the transfer cost is fixed.

SigmundA5 months ago

Would be interesting to see what a unified memory setup can do like say an Apple M-series since this is the argument for unified memory, zero copy memory access between CPU and GPU.

评论 #42390851 未加载

评论 #42389918 未加载

ramoz5 months ago

Simpler research could've shown that there is a physical data transfer cost.

评论 #42394775 未加载

ltbarcly35 months ago

Taking a helicopter is not always faster than walking.Is this surprising or obvious?

shihab5 months ago

An otherwise valid point made using a terrible example.

评论 #42395815 未加载

评论 #42400137 未加载

hangonhn5 months ago

Question from someone who doesn't know enough about GPUs: Recently a friend mentioned his workstation has 384 cores using 4 processors. This is starting to approach some of the core numbers of earlier GPUs.Is there a possibility that in the not too distant future that GPUs and CPUs will just converge? Or are the tasks done by GPUs too specialized?

评论 #42396205 未加载

评论 #42396650 未加载

评论 #42391151 未加载

评论 #42391232 未加载

评论 #42391213 未加载

gdiamos5 months ago

Memory locality depends on your perspective.The CPU would always be slower if the data originated in GPU memory.

评论 #42398310 未加载

fancyfredbot5 months ago

This article is a CPU benchmark and a PCI express bandwidth benchmark. It's masquerading as something else though.

shaklee35 months ago

To save some time for people, the title is absolutely incorrect. The GPU is significantly faster for this test, but the author is measuring PCIe bandwidth from a PCIe generation of 10 years ago. If instead they had used pcie5 the bandwidth would be quadrupled.

rnrn5 months ago

how can the multicore AVX implementation do a dot product (for arrays much larger than cache) at 340 GB/s on a system with RAM bandwidth < 50 GB/s

评论 #42395937 未加载

评论 #42390288 未加载

ImHereToVote5 months ago

The GPU is never faster. It's parallel.

ryao5 months ago

The same thing applies to using a GPU to do inference with your weights in system memory. That is why nobody does that.

moomin5 months ago

Even if the GPU took literally no time at all to compute the results there would be workflows where doing it on the CPU was faster.

评论 #42389621 未加载

评论 #42394971 未加载