Ask HN: How does a CPU communicate with a GPU?

148 pointsby pedrolinsabout 3 years ago

I've been learning about computer architecture [1] and I've become comfortable with my understanding of how a processor communicates with main memory - be it directly, with the presence of caches or even virtual memory - and I/O peripherals.But something that seems weirdly absent from the courses I took and what I have found online is how the CPU communicates with other processing units, such as GPUs - not only that, but an in-depth description of interconnecting different systems with buses (by in-depth I mean an RTL example/description).I understand that as you add more hardware to a machine, complexity increases and software must intervene - so a generalistic answer won't exist and the answer will depend on the implementation being talked about. That's fine by me.What I'm looking for is a description of how a CPU tells a GPU to start executing a program. Through what means do they communicate - a bus? How does such a communication instance look like?I'd love get pointers to resources such as books and lectures that are more hands-on/implementation aware.[1] Just so that my background knowledge is clear: I've concluded NAND2TETRIS, watched and concluded Berkeley's 2020 CS61C and have read a good chunk of H&P (both Computer Architecture: A Quantitative Approach and Computer Organization and Design: RISC-V edition), and now am moving on to Onur Mutlu's lectures on advanced computer architecture.

21 comments

rayinerabout 3 years ago

Typically CPU and GPU communicate over the PCI Express bus. (It’s not technically a bus but a point to point connection.) From the perspective of software running on the CPU, these days, that communication is typically in the form of memory-mapped IO. The GPU has registers and memory mapped into the CPU address space using PCIE. A write to a particular address generates a message on the PCIE bus that’s received by the GPU and produces a write to a GPU register or GPU memory.The GPU also has access to system memory through the PCIE bus. Typically, the CPU will construct buffers in memory with data (textures, vertices), commands, and GPU code. It will then store the buffer address in a GPU register and ring some sort of “doorbell” by writing to another GPU register. The GPU (specifically, the GPU command processor) will then read the buffers from system memory, and start executing the commands. Those commands can include, for example, loading GPU shader programs into shader memory and triggering the shaders to execute those shaders.

评论 #30860953 未加载

评论 #30860524 未加载

评论 #30862718 未加载

评论 #30860466 未加载

zoenolanabout 3 years ago

Other are not wrong in saying Memory mapped IO. taking a look at the Amiga hardware Reference manual [1] and a simple example [2] or a NES programming guide [3] would be a good way to see this in operation.A more modern CPU/GPU setup is likely to use a ring buffer. The buffer will be in CPU memory. That memory is also mapped into the GPU address space. The Driver on the CPU will write commands into the buffer which the GPU will execute. These will be different to the shader unit instruction set.Commands would be setting some internal GPU register to a value. Allowing the setting resolution, framebuffer base pointer, set up the output resolution, setting the mouse pointer position, reference a texture from system memory, load a shader, execute a shader, set a fence value (Useful for seeing when a resource, texture, shader is no longer in use).Hierarchical DMA buffers are a useful feature of some DMA engines. You can think of them as similar to sub routines. The command buffer can contain an instruction to switch execution to another chunk of memory. This allows the driver to reuse operations or expensive to generate sequences. OpenGL's display list commonly compiled down to separate buffer.[1] <a href="https://archive.org/details/amiga-hardware-reference-manual-3rd-edition" rel="nofollow">https://archive.org/details/amiga-hardware-reference-manual-...</a>[2] <a href="https://www.reaktor.com/blog/crash-course-to-amiga-assembly-programming/" rel="nofollow">https://www.reaktor.com/blog/crash-course-to-amiga-assembly-...</a>[3] <a href="https://www.nesdev.org/wiki/Programming_guide" rel="nofollow">https://www.nesdev.org/wiki/Programming_guide</a>

simneabout 3 years ago

Lot of things happen there.But most important, PCIe bus is serial bus, which have virtualized interface, so there is no physical process of communication, what happen more similar to Ethernet network, mean on each device exists few endpoints, each has it's own controller with its own address and few registers to store state and transitions, and memory buffer(s).Videocards usually have many behaviors. In simplest modes, they behave just as RAM mapped to large chunk of system RAM space, plus video registers to control video output, and to control address mapping of video ram, and to switch modes.In more complex modes, Videocards generate interrupts (just special type of message on PCIe).In 3D modes, which are most complex, Videocontroller take data from its own memory (which mapped to system space), there are stored tree of graphic primitives, some draw directly from videoram, but for others used bus master option of PCIe, in which videocontroller read additional data (textures) from predefined chunks of system RAM.About GPU operation, usually, CPU copy data to Videoram directly, than ask videocontroller to run program in videoram, and when complete, GPU issue interrupt, and than CPU copied result from videoram.Recent additions where, add GPU possibility to read data from system disks, using mentioned before bus master, but those additions are not already wide implemented.

评论 #30871491 未加载

评论 #30861240 未加载

kllrnohjabout 3 years ago

The OSDev Wiki is a great resource on how this all works from the perspective of actually programming it at least on x86For example here's the page on talking PCI-E <a href="https://wiki.osdev.org/PCI_Express" rel="nofollow">https://wiki.osdev.org/PCI_Express</a>

melenaboijaabout 3 years ago

It is old and I am not sure everything still applies but I found this course useful to understand how GPUs work:Intro to Parallel Programming:<a href="https://classroom.udacity.com/courses/cs344" rel="nofollow">https://classroom.udacity.com/courses/cs344</a><a href="https://developer.nvidia.com/udacity-cs344-intro-parallel-programming" rel="nofollow">https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...</a>

aliasariaabout 3 years ago

There is some good information on how PCI-Express works here: <a href="https://blog.ovhcloud.com/how-pci-express-works-and-why-you-should-care-gpu/" rel="nofollow">https://blog.ovhcloud.com/how-pci-express-works-and-why-you-...</a>

评论 #30873636 未加载

phendrenad2about 3 years ago

At a high level, it's actually really simple. Your PCIe devices are each given a region of the address space, say, 0x8428000000000000-0x8428000000000fff. Just write to that region from kernel mode. But what do you write? Well, that isn't standardized. It's not even really documented. The best documentation is the source code to the GPU drivers in the Linux kernel, which are usually added to by engineers working at GPU vendors, and they don't discuss it much.

评论 #30866895 未加载

ar_teabout 3 years ago

And I you looking for some strange architecture forgoten by time:). <a href="https://www.copetti.org/writings/consoles/sega-saturn/" rel="nofollow">https://www.copetti.org/writings/consoles/sega-saturn/</a>

pizza234about 3 years ago

You'll find a very good introduction in the comparch book "Write Great Code, Volume 1", chapter 12 ("Input and Output"), which also explains the history of system buses (therefore, you'll find an explanation of how ISA works).Interestingly, there is a footnote explaining that "Computer Architecture: A Quantitative Approach provided a good chapter on I/O devices and buses; sadly, as it covered very old peripheral devices, the authors dropped the chapter rather than updating it in subsequent revisions."

评论 #30862285 未加载

ncmncmabout 3 years ago

While we're here: is there any reasonable prospect of keeping one's GPU from being able to read and write to literally anywhere in physical memory?I.e., a practical way a kernel and driver might be able to forward to the GPU only commands and shaders that can access only your process memory, and nobody else's, and your process's pixels, and no other process's pixels, when they live in GPU RAM?For all I know, this is the norm for all GPUs, but I wonder why it is hard, then, for VMs to share a GPU.

评论 #30863012 未加载

评论 #30863647 未加载

justsomehnguyabout 3 years ago

TL;DR: bi-directional memory access with some means to notify the other part about "something has changed".It's not that different for any other PIC/E device, be it a network card or a disk/HBA/RAID controller.If you want to understand how it came to this - look at the history of ISA, PCI/PCI-X, a short stint for AGP and finally PCI-E.Other comments provides a good ELI15 for the topic.A minor note about "bus" - for PCEe it is mostly a historic term, because it's a serial, P2P connection, though the process of enumerating and qurying the devices is still very akin to what you would do on some bus-based system, e.g.: SAS is a serial "bus", compared to SCSI, but still you operate with it as some "logical" bus, because it is easier for humans to grok it this way.

评论 #30862456 未加载

derekzhouzhenabout 3 years ago

Other has mentioned MMIO. MMIO has several kinds:1. CPU accessing GPU hw with uncache-able MMIO, such as lower level register access2. GPU accessing CPU memory with cache-able MMIO, or DMA. such as command and data stream3. CPU accessing GPU memory with cache-able MMIO, such as texturesThey all happen on the bus with different latency and bandwidth.

chubotabout 3 years ago

BTW I believe memory maps are set up by the ioctl() system call on Unix (including OS X), which is kind of a "catch all" hole poked through the kernel. Not sure about Windows.I didn't understand that for a long time ...I would like to see a "hello world GPU" example. I think you open() the device and the ioctl() it ... But what happens when things go wrong?Similar to this "Hello JIT", where it shows you have to call mmap() to change permissions on the memory to execute dynamically generated code.<a href="https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-simple-jits.html" rel="nofollow">https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-...</a>I guess one problem is that this may be typically done in vendor code and they don't necessarily commit to an interface? They make you link their huge SDK

dyingkneepadabout 3 years ago

On my system, the CPU sees the GPU as a PCI device. The "PCI config space" [0] is a standard thing and so the CPU can read it and figure out its device ID, vendor ID, revision, class, etc. From that, the OS looks at its PCI drivers and tries to find which one claims to drive that specific PCI device_id/vendor_id combination (or class in case there's some kind of generic universal driver for a certain class).From there, the driver pretty much knows what to do. But primarily the driver will map the registers to memory addresses, so accessing offset 0xF0 from that map is equivalent as accessing register 0xF0. The definition of what each register does is something that the HW developers provide to the SW developers [1].Setting modes (screen resolution) and a lot of other stuff is done directly by reading and writing to these registers. At some point they also have to talk about memory (and virtual addresses) and there's quite a complicated dance to map GPU virtual memory to CPU virtual memory. On discrete GPUs the data is actually "sent" to the memory somehow through the PCI bus (I suppose the GPU can read directly from the memory without going through the CPU?), but in the driver this is usually abstracted to "this is another memory map". On integrated systems both the CPU and GPU read directly from the system memory, but they may not share all caches so extra care is required here. In fact, caches may also mess the communication on discrete graphics, so extra care is always required. This paragraph is mostly done by the Kernel driver in Linux.At some point the CPU will tell the GPU that a certain region of memory is the framebuffer to be displayed. And then the CPU will formulate binary programs that are written in the GPU's machine code, and the CPU will submit those programs (batches) and the GPU will execute them. These programs are generally in the form of "I'm using textures from these addresses, this memory holds the fragment shader, this other holds the geometry shader, the configuration of threading and execution units is described in this structure as you specified, SSBO index 0 is at this address, now go and run everything". After everything is done the CPU may even get an interrupt from the GPU saying things are done, so they can notify user space. This paragraph describes mostly the work done by the user space driver (in Linux, this is Mesa), which implements OpenGL/Vulkan/etc abstractions.[0]: <a href="https://en.wikipedia.org/wiki/PCI_configuration_space" rel="nofollow">https://en.wikipedia.org/wiki/PCI_configuration_space</a> [1]: <a href="https://01.org/linuxgraphics/documentation/hardware-specification-prms" rel="nofollow">https://01.org/linuxgraphics/documentation/hardware-specific...</a>

throwmeariver1about 3 years ago

Everyone in tech should read the book "Understanding the Digital World" by Brian W. Kernighan.

评论 #30860874 未加载

评论 #30860706 未加载

cesarbabout 3 years ago

> What I'm looking for is a description of how a CPU tells a GPU to start executing a program. Through what means do they communicate - a bus? How does such a communication instance look like?For most modern computers, through the PCI Express bus. Take a look at the output of "lspci -v" and you'll see something like:<pre><code> 00:02.0 VGA compatible controller: [...] [...] Flags: bus master, fast devsel, latency 0, IRQ 128 Memory at ee000000 (64-bit, non-prefetchable) [size=16M] Memory at d0000000 (64-bit, prefetchable) [size=256M] I/O ports at f000 [size=64] Expansion ROM at 000c0000 [virtual] [disabled] [size=128K] </code></pre> That is, the GPU on this particular laptop makes available a region of memory sized 16 megabytes at physical address 0xee000000, and another region of memory sized 256 megabytes at physical address 0xd0000000. Whenever the CPU writes to or reads from these memory regions, it is writing to memory on the GPU, not on the normal RAM chips. And not all of that "memory" on the GPU is real memory; some of it are registers, which are used to control the GPU.The same happens on the opposite direction: for code running on the GPU, some regions of memory are actually the RAM normally used by the CPU. In either case, the memory read and/or write transactions go through the PCI Express bus to the other device.The exact details of what is written to (and read from) that memory vary depending on the device. For most GPUs, the driver sets up a list of commands in memory (either "host" memory, which is the RAM on the CPU, or "device" memory, which is the RAM on the GPU accessible through these PCI Express "memory windows"), and writes the address of that command list to a register on the GPU; the GPU then reads the list and executes the commands found in it. These commands can include things like "start N threads of the program found at X with Y as the input" (GPU programs are commonly called "shaders", and they are highly parallel), but also things like "wait for event W to happen before doing Z".

brooksbpabout 3 years ago

Woah there, my dude. Let's try to understand a simple model first.A CPU can access memory. When a CPU performs loads & stores it initiates transactions containing the address of the memory. Therefore, it is a bus master--it initiates transactions. A slave accepts transactions and services them. The interconnect routes those transactions to the appropriate hardware, e.g. the DDR controller, based on the system address map.Let's add a CPU, interconnect, and 2GB of DRAM memory:<pre><code> +-------+ | CPU | +---m---+ | +---s--------------------+ | Interconnect | +-------m----------------+ | +----s-----------+ | DDR controller | +----------------+ System Address Map: 0x8000_0000 - 0x0000_0000 DDR controller </code></pre> So, a memory access to 0x0004_0000 is going to DRAM memory storage.Let's add a GPU.<pre><code> +-------+ +-------+ | CPU | | GPU | +---m---+ +---s---+ | | +---s------------m-------+ | Interconnect | +-------m----------------+ | +----s-----------+ | DDR controller | +----------------+ System Address Map: 0x9000_0000 - 0x8000_0000 GPU 0x8000_0000 - 0x0000_0000 DDR controller </code></pre> Now the CPU can perform loads & stores from/to the GPU. The CPU can read/write registers in the GPU. But that's only one-way communication. Let's make the GPU a bus master as well:<pre><code> +-------+ +-------+ | CPU | | GPU | +---m---+ +--s-m--+ | | | +---s-----------m-s-----+ | Interconnect | +-------m----------------+ | +----s-----------+ | DDR controller | +----------------+ System Address Map: 0x9000_0000 - 0x8000_0000 GPU 0x8000_0000 - 0x0000_0000 DDR controller </code></pre> Now, the GPU can not only receive transactions, but it can also initiate transactions. Which also means it has access to DRAM memory too.But this is still only one-way communication (CPU->GPU). How can the GPU communicate to the CPU? Well, both have access to DRAM memory. The CPU can store information in DRAM memory (0x8000_0000 - 0x0000_0000) and then write to a register in the GPU (0x9000_0000 - 0x8000_0000) to inform the GPU that the information is ready. The GPU then reads that information from DRAM memory. In the other direction, the GPU can store information in DRAM memory, and then send an interrupt to the CPU to inform the CPU that the information is ready. The CPU then reads that information from DRAM memory. An alternative to using interrupts is to have the CPU poll. The GPU stores information in DRAM memory and then sets some bit in DRAM memory. The CPU polls on this bit in DRAM memory, and when it changes, the CPU knows that it can read the information in DRAM memory that was previously written by the GPU.Hope this helps. It's very fun stuff!

评论 #30862375 未加载

评论 #30864530 未加载

评论 #30867612 未加载

roschdalabout 3 years ago

Through the electrical wires in the PCI express port.

评论 #30860445 未加载

评论 #30860414 未加载

Randolf_Scottabout 3 years ago

Drivers make all hardware communicate.

dragontamerabout 3 years ago

I'm no expert on PCIe, but its been described to me as a network.PCIe has switches, addresses, and so forth. Very much like IP-addresses, except PCIe operates on a significantly faster level.At its lowest-level, PCIe x1 is a single "lane", a singular stream of zeros-and-ones (with various framing / error correction on top). PCIe x2, x4, x8, and x16 are simply 2x, 4x, 8x, or 16 lanes running in parallel and independently.-------PCIe is a very large and complex protocol however. This "serial" comms can become abstracted into Memory-mapped I/O. Instead of programming at the "packet" level, most PCIe operations are seen as just RAM.> even virtual memorySo you understand virtual memory? PCIe abstractions go up to and include the virtual memory system. When your OS sets aside some virtual-memory for PCIe devices, when programs read/write to those memory-addresses, the OS (and PCIe bridge) will translate those RAM reads/writes into PCIe messages.--------I now handwave a few details and note: GPUs do the same thing on their end. GPUs can also have a "virtual memory" that they read/write to, and translates into PCIe messages.This leads to a system called "Shared Virtual Memory" which has become very popular in a lot of GPGPU programming circles. When the CPU (or GPU) read/write to a memory address, it is then automatically copied over to the other device as needed. Caching layers are layered on top to improve the efficiency (Some SVM may exist on the CPU-side, so the GPU will fetch the data and store it in its own local memory / caches, but always rely upon the CPU as the "main owner" of the data. The reverse, GPU-side shared memory, also exists, where the CPU will communicate with the GPU).To coordinate access to RAM properly, the entire set of atomic operations + memory barriers have been added to PCIe 3.0+. So you can perform "compare-and-swap" to shared virtual memory, and read/write to these virtual memory locations in a standardized way across all PCIe devices.PCIe 4.0 and PCIe 5.0 are adding more and more features, making PCIe feel more-and-more like a "shared memory system", akin to cache-coherence strategies that multi-CPU / multi-socket CPUs use to share RAM with each other. In the long term, I expect Future PCIe standards to push the interface even further in this "like a dual-CPU-socket" memory-sharing paradigm.This is great because you can have 2-CPUs + 4 GPUs on one system, and when GPU#2 writes to Address#0xF1235122, the shared-virtual-memory system automatically translates that to its "physical" location (wherever it is), and the lower-level protocols pass the data to the correct location without any assistance from the programmer.This means that a GPU can do things like perform a linked-list traversal (or tree traversal), even if all of the nodes of the tree/list are in CPU#1, CPU#2, GPU#4, and GPU#1. The shared-virtual-memory paradigm just handwaves the details and lets PCIe 3.0 / 4.0 / 5.0 protocols handle the details automatically.

评论 #30861447 未加载

评论 #30862414 未加载

raszabout 3 years ago

On the PC side start by reading some basics like <a href="https://archive.org/details/URP_8th_edition/" rel="nofollow">https://archive.org/details/URP_8th_edition/</a> (never editions require logging in and borrowing)>What I'm looking for is a description of how a CPU tells a GPU to start executing a program. Through what means do they communicate - a bus? How does such a communication instance look like?Long time ago you would memory map the framebuffer and just write directly to it.Then first 2D acceleration showed up in 1987 in form of IBM 8514 (later cloned by ATI/Matrox/S3/Tseng and others). You wrote commands one at a time using I/O port access to FIFO with pooling for idle/full, no direct access to the framebuffer <a href="http://www.os2museum.com/wp/the-8514a-graphics-accelerator/" rel="nofollow">http://www.os2museum.com/wp/the-8514a-graphics-accelerator/</a>Next evolution was MMIO - memory mapped IO. You no longer executed dedicated CPU IO instruction (assembler IN/OUT), IO ports were simply addresses in memory. You still had FIFOs and wrote one command at a time <a href="http://www.o3one.org/hwdocs/video/voodoo_graphics.pdf" rel="nofollow">http://www.o3one.org/hwdocs/video/voodoo_graphics.pdf</a>Then someone threw DMA into the mix. Now you could DMA contents of a circular buffer filled with your commands <a href="http://www.bitsavers.org/components/s3/DB019-B_ViRGE_Integrated_3D_Accelerator_Aug1996.pdf" rel="nofollow">http://www.bitsavers.org/components/s3/DB019-B_ViRGE_Integra...</a>We finally got command list/command buffer/bundle copied directly to the GPU.Nowadays you have multiple command lists/command buffers/bundles going in parallel <a href="https://developer.nvidia.com/blog/advanced-api-performance-command-buffers/" rel="nofollow">https://developer.nvidia.com/blog/advanced-api-performance-c...</a>On a hardware side 8/16 bit ISA bus was a shared parallel connection to CPU bus at fixed clock (4.77-10MHz, 4 clocks per transfer, ~5MB/s max speed).It took us up to 1992 to get the next commonly used solution, a "rogue" consortium of companies tired of IBM shit designed VESA Local Bus (a true hack) in form of slapping expansion cards direct on the raw 32bit CPU bus of 486 processors. Cheap, no licensing fees, extremely fast (40MHz x 32bit = potentially faster than later PCI), easy to implement.This got replaced with the advent of Pentium (64bit external CPU data bus) and introduction of PCI. PCI is still a shared parallel bus, but this time 32bits at 33MHz with packetized transactions.AGP was "just" a faster PCI on its own dedicated separate controller (no contention with other PCI devices) and optimized addressing (sideband). 32bit at 66MHz, then x2 DDR, x4 QDR, x8 ODR. Last one means there are 8 transfers taking place between one clock cycle for a nice 2GB/s.PCI-E is faster bidirectional serial point-to-point PCI with ability to combine links into bundles (x1-x16). PCI-E devices live on a network switch and dont block each other from talking simultaneously. You could think of PCI-E as every PCI device getting its own dedicated dual direction AGP connector.Some vintage hands on coding examples:2D Tseng Labs ET4000 coding <a href="https://www.youtube.com/watch?v=K8kZ4BFxOtc" rel="nofollow">https://www.youtube.com/watch?v=K8kZ4BFxOtc</a>2D Cirrus Logic <a href="https://www.youtube.com/watch?v=WoAE7x-u1g0" rel="nofollow">https://www.youtube.com/watch?v=WoAE7x-u1g0</a>"How 3D acceleration started 20 years ago: S3/Virge register level programming" <a href="https://www.youtube.com/watch?v=fXJ11_wG_0U" rel="nofollow">https://www.youtube.com/watch?v=fXJ11_wG_0U</a>"Acceleration code working on real S3 Virge/DX" <a href="https://www.youtube.com/watch?v=Hsg1N4IqXac" rel="nofollow">https://www.youtube.com/watch?v=Hsg1N4IqXac</a>"Direct hardware accelerated 3d in 20kB code" <a href="https://www.youtube.com/watch?v=n509_wN02u8" rel="nofollow">https://www.youtube.com/watch?v=n509_wN02u8</a>"Bare metal hardware 3d texturing in 23kb of code w/ S3/Virge" <a href="https://www.youtube.com/watch?v=UgvBGXiw6LY" rel="nofollow">https://www.youtube.com/watch?v=UgvBGXiw6LY</a>"Testing our latest low-level hardware 3d code on real S3/Virge hardware" <a href="https://www.youtube.com/watch?v=px--LWdRoYA" rel="nofollow">https://www.youtube.com/watch?v=px--LWdRoYA</a>"Live coding and testing more low-level 3D w/ S3/Virge" <a href="https://www.youtube.com/watch?v=l3lH0cIZUSA" rel="nofollow">https://www.youtube.com/watch?v=l3lH0cIZUSA</a>"Finishing low-level hardware S3/Virge acceleration demo" <a href="https://www.youtube.com/watch?v=JmfeB2LEDbc" rel="nofollow">https://www.youtube.com/watch?v=JmfeB2LEDbc</a>"3dfx Voodoo: Low-level & bare-metal driver-less code" <a href="https://www.youtube.com/watch?v=LDT6KlfOG2k" rel="nofollow">https://www.youtube.com/watch?v=LDT6KlfOG2k</a>"Finally 3dfx Voodoo triangles" <a href="https://www.youtube.com/watch?v=ZWaDqY4gqhw" rel="nofollow">https://www.youtube.com/watch?v=ZWaDqY4gqhw</a>"More GPU programming Voodoo case study" <a href="https://www.youtube.com/watch?v=AYZvNyxFHqk" rel="nofollow">https://www.youtube.com/watch?v=AYZvNyxFHqk</a>"Quite final 3dfx Voodo low-level code working" <a href="https://www.youtube.com/watch?v=2ADQgIEWrx4" rel="nofollow">https://www.youtube.com/watch?v=2ADQgIEWrx4</a>