Why Registers Are Fast and RAM Is Slow

225 pointsby anandabitsover 11 years ago

18 comments

yanover 11 years ago

For a way more detailed look at memory architectures and implementation, check out Ulrich Drepper's classic paper "What Every Programmer Should Know About Memory"[1][1] <a href="http://www.akkadia.org/drepper/cpumemory.pdf" rel="nofollow">http://www.akkadia.org/drepper/cpumemory.pdf</a>

评论 #6535119 未加载

评论 #6534516 未加载

评论 #6536342 未加载

codexover 11 years ago

A simple thought experiment suffices here. What is the shape which holds the most physical bits while minimizing the overall latency for random access? It's a sphere. Each bit occupies a space packed within that sphere. The radius of the sphere is the distance that light must traverse, and thus corresponds to latency."slow" elements of the memory hierarchy are on the outside of the sphere, while faster elements (cache, registers, etc) are layered on the inside, like an onion. Since those spheres are smaller they must, by definition, hold fewer bits, but they are, by definition, faster.The total number of bits you can store is a function of the volume of the sphere. For a given latency level, it's a function of the surface area of the sphere at a given radius.The volume of a sphere is 4/3pir^3. Because latency is a function of the radius (how far it takes light to bounce to the edge of the sphere and back) that means that latency must rise as at least the cube root of the number of bits you want to store. That is the best possible bound.This implies that no algorithm is ever O(1) time for an asymptotically large number of elements accessed randomly--not even hash tables or pointer dereferences. They're at best O(n^1/3).

评论 #6534517 未加载

评论 #6534575 未加载

joostersover 11 years ago

Has anyone done a study on the optimal number of registers to have?The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers? Why not have more? After all, x86-64 and ARM64 decided that having more suited them.In the end it must come down to a compromise, with the downsides of having more registers possibly being some of the following:* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)* Increased latency for interrupts? e.g. if your CPU has 1000 registers and an interrupt occurs, you're going to end up having to save all those 1000 registers somewhere. There could be some HW-assist but you'll pay the price somewhere.* Extra cost for saving registers in functions. Sure, depends upon the ABI as some registers will be 'scratch' and not preserved between function calls, but if you've got more registers you'll end up wanting to save more of them.* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.* Registers still need to be 'spilled' to memory. In an extreme case, you could imagine compiling a small program where every variable maps to a unique register. Ultimate speed! But asides from that optimal case, you'll end up still having to write registers back to memory. It makes no difference having 100 registers if you store the results of every computation...Anyway, that's all speculation. I was wondering if someone had done a study. You could construct a virtual, bespoke CPU with n registers, then make gcc compile some SPEC benchmarks using the ISA and model it to see how efficient having an extra register makes it. You could graph registers vs simulated runtime and see where the sweet spot is.

评论 #6535026 未加载

评论 #6535801 未加载

评论 #6534715 未加载

评论 #6534919 未加载

评论 #6534772 未加载

评论 #6535973 未加载

评论 #6535436 未加载

评论 #6534667 未加载

Theodoresover 11 years ago

Perhaps it would clarify things with analogy:Let's say Bubba's watching the Super Bowl. The table in front of him are his registers, the fridge is cache, and the corner shop a quick walk away is memory.He looks and see he doesn't have any beer on the table. So he goes to the fridge and gets what he wants, and comes back to the couch. Later, Bubba runs out of beer (useful data). This is a cache miss, so he has to go down to the corner store and get some beer. Instead of just getting what he wants, maybe he gets some Hungry-Man frozen dinners, in case he'll want some later. He goes back, puts the beer and TV dinner in the fridge, and brings some beers with him to the table. Next time he runs out of beer, he goes to the corner store, but they're all out of beer. So he buys some seed, tills the fields, and grows his own barley. This is disk access.<a href="http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-library-cache-metaphor" rel="nofollow">http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-library-...</a>

评论 #6534842 未加载

revelationover 11 years ago

Theres something in between, which you will find on microcontrollers: SRAM. If you use simple architectures, like AVR, you also get completely deterministic timings for a load from SRAM (e.g. 2 cycles for AVR).Edit: Chill, everyone. Yes, it's "implementation detail of the substrate", but it is a very important implementation detail given that it is directly exposed to the programmer as memory, not in some automagically managed cache.

评论 #6533827 未加载

评论 #6533830 未加载

评论 #6534687 未加载

评论 #6534178 未加载

评论 #6537752 未加载

评论 #6534022 未加载

twoodfinover 11 years ago

Do any current ARM implementations do register renaming over a physical register set larger than the architected set?Obviously Intel has been doing this for a while: Haswell has something like 168 integer registers, while the x86-64 ISA only exposes 16.EDIT: Some Googling tells me that at least the Cortex-A9 mapped 32 architectural registers to 56 physical: <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CHDIBEGC.html" rel="nofollow">http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....</a>

评论 #6534360 未加载

Symmetryover 11 years ago

That article did a lot of simplifying, but probably simplifying that was needed for the person who asked that question.An interesting thing about Apples take on AArch64 in particular that some people have been speculating about is about how Apple's Cyclone core's memory subsystem works. ARM cores usually use the virtual (post-MMU) address of data to determine where in the cache data lives, but if you stick with page size as big or bigger than the L1 size you can start your L1 lookup at the same time you do your TLB lookup, and save a lot of latency. Apple's control of the OS is what lets them force 64K page sizes.

eksithover 11 years ago

This part stood out : "The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions."And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code."This may potentially explain why a smaller executable isn't necessarily faster when executing. I guess a lot of compiler gymnastics are devoted to breaking down complex instructions to take advantage of this.

评论 #6535474 未加载

评论 #6534199 未加载

评论 #6534231 未加载

eliteraspberrieover 11 years ago

Great explanation for folks without a hardware background. I also enjoyed his previous article about ARM64. Thanks for sharing.

ams6110over 11 years ago

As a kid I had a TI 99/4a. The TMS9900 processor didn't have any registers, it had a "workspace pointer" which let you treat a block of RAM as your registers. This was slow, but in theory allowed for convenient context switches as you'd just load a new workspace pointer.Do any modern CPUs still use an approach like this?

评论 #6534376 未加载

rluover 11 years ago

It's funny reading this and then remembering that on top of all this, there's paging (i.e. fetching from hard drive).It's like registers are refrigerators, RAM is like the grocery store around the corner, and Page faults are like the grocery stores in a neighboring townwoooooo memory!

评论 #6537473 未加载

taericover 11 years ago

While I personally love this answer, I have to admit a basic physical metaphor works. If you remember an answer, it is practically immediate. The further back in your records you have to go to find something, the slower it will be.We have faster ways of recalling notes today than we did in the past, you might say? Well, yeah. In many respects our ram is faster than registers of early computers, too. That all things have gotten faster doesn't change that things which were faster are still faster. (I'd be delighted to know examples where this radically changed somehow.)

grogenautover 11 years ago

Jebus christ, it's because they're close. Like IN the cpu. Not nuzzled not ON, not NEXT TO.Hell, if you know and optimize for registers and don't know why they're fast, you should be shot. Otherwise you're using a language that doesn't really give you control over registers why do you care?Okay okay, I like reading about the blackbird and I know that I know nothing about how it really works other than lots of fuel. Still. Okay, I'm a Hypokrite.

arsover 11 years ago

While the CPU is waiting for data to load from RAM, is the operating system smart enough to give it a different task to execute?

评论 #6534349 未加载

评论 #6534090 未加载

评论 #6537853 未加载

评论 #6535115 未加载

markb139over 11 years ago

But electrical signals don't propogate at the speed of light in a vacuum ( I didn't read past that point). The signals travel at about 2/3 the speed of light. This is very significant when you look at path lenghts

gonzoover 11 years ago

I'll just leave this here. <a href="https://www.usenix.org/system/files/1309_14-17_mickens.pdf" rel="nofollow">https://www.usenix.org/system/files/1309_14-17_mickens.pdf</a>

baneover 11 years ago

So what's the state of research in breaking out of the Von Neumann approach and going with a RAM-free architecture where the CPU has m(b)illions of registers you just do everything in? Of course it's expensive, but let's say you have effectively infinite dollars, is this a good idea?

评论 #6537800 未加载

评论 #6537466 未加载

georgecmuover 11 years ago

Cites distance and cost/power as reasons why "RAM is slow, registers are fast", not a mention of differences between SRAM and (S)DRAM.Not worth reading.

评论 #6534198 未加载

评论 #6533831 未加载