For a way more detailed look at memory architectures and implementation, check out Ulrich Drepper's classic paper "What Every Programmer Should Know About Memory"[1]<p>[1] <a href="http://www.akkadia.org/drepper/cpumemory.pdf" rel="nofollow">http://www.akkadia.org/drepper/cpumemory.pdf</a>
A simple thought experiment suffices here. What is the shape which holds the most physical bits while minimizing the overall latency for random access? It's a sphere. Each bit occupies a space packed within that sphere. The radius of the sphere is the distance that light must traverse, and thus corresponds to latency.<p>"slow" elements of the memory hierarchy are on the outside of the sphere, while faster elements (cache, registers, etc) are layered on the inside, like an onion. Since those spheres are smaller they must, by definition, hold fewer bits, but they are, by definition, faster.<p>The total number of bits you can store is a function of the volume of the sphere. For a given latency level, it's a function of the surface area of the sphere at a given radius.<p>The volume of a sphere is 4/3<i>pi</i>r^3. Because latency is a function of the radius (how far it takes light to bounce to the edge of the sphere and back) that means that latency <i>must</i> rise as at least the cube root of the number of bits you want to store. That is the best possible bound.<p>This implies that no algorithm is ever O(1) time for an asymptotically large number of elements accessed randomly--not even hash tables or pointer dereferences. They're at best O(n^1/3).
Has anyone done a study on the optimal number of registers to have?<p>The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers? Why not have more? After all, x86-64 and ARM64 decided that having more suited them.<p>In the end it must come down to a compromise, with the downsides of having more registers possibly being some of the following:<p>* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)<p>* Increased latency for interrupts? e.g. if your CPU has 1000 registers and an interrupt occurs, you're going to end up having to save all those 1000 registers somewhere. There could be some HW-assist but you'll pay the price somewhere.<p>* Extra cost for saving registers in functions. Sure, depends upon the ABI as some registers will be 'scratch' and not preserved between function calls, but if you've got more registers you'll end up wanting to save more of them.<p>* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.<p>* Registers still need to be 'spilled' to memory. In an extreme case, you could imagine compiling a small program where every variable maps to a unique register. Ultimate speed! But asides from that optimal case, you'll end up still having to write registers back to memory. It makes no difference having 100 registers if you store the results of every computation...<p>Anyway, that's all speculation. I was wondering if someone had done a study. You could construct a virtual, bespoke CPU with n registers, then make gcc compile some SPEC benchmarks using the ISA and model it to see how efficient having an extra register makes it. You could graph registers vs simulated runtime and see where the sweet spot is.
Perhaps it would clarify things with analogy:<p>Let's say Bubba's watching the Super Bowl. The table in front of him are his registers, the fridge is cache, and the corner shop a quick walk away is memory.<p>He looks and see he doesn't have any beer on the table. So he goes to the fridge and gets what he wants, and comes back to the couch. Later, Bubba runs out of beer (useful data). This is a cache miss, so he has to go down to the corner store and get some beer. Instead of just getting what he wants, maybe he gets some Hungry-Man frozen dinners, in case he'll want some later. He goes back, puts the beer and TV dinner in the fridge, and brings some beers with him to the table. Next time he runs out of beer, he goes to the corner store, but they're all out of beer.
So he buys some seed, tills the fields, and grows his own barley. This is disk access.<p><a href="http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-library-cache-metaphor" rel="nofollow">http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-library-...</a>
Theres something in between, which you will find on microcontrollers: SRAM. If you use simple architectures, like AVR, you also get completely deterministic timings for a load from SRAM (e.g. 2 cycles for AVR).<p>Edit: Chill, everyone. Yes, it's "implementation detail of the substrate", but it is a <i>very important implementation detail</i> given that it is directly exposed to the programmer as memory, not in some automagically managed cache.
Do any current ARM implementations do register renaming over a physical register set larger than the architected set?<p>Obviously Intel has been doing this for a while: Haswell has something like 168 integer registers, while the x86-64 ISA only exposes 16.<p>EDIT: Some Googling tells me that at least the Cortex-A9 mapped 32 architectural registers to 56 physical: <a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0388e/CHDIBEGC.html" rel="nofollow">http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....</a>
That article did a lot of simplifying, but probably simplifying that was needed for the person who asked that question.<p>An interesting thing about Apples take on AArch64 in particular that some people have been speculating about is about how Apple's Cyclone core's memory subsystem works. ARM cores usually use the virtual (post-MMU) address of data to determine where in the cache data lives, but if you stick with page size as big or bigger than the L1 size you can start your L1 lookup at the same time you do your TLB lookup, and save a lot of latency. Apple's control of the OS is what lets them force 64K page sizes.
This part stood out : "The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions."<p>And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code."<p>This may potentially explain why a smaller executable isn't necessarily faster when execut<i>ing</i>. I guess a lot of compiler gymnastics are devoted to breaking down complex instructions to take advantage of this.
As a kid I had a TI 99/4a. The TMS9900 processor didn't have any registers, it had a "workspace pointer" which let you treat a block of RAM as your registers. This was slow, but in theory allowed for convenient context switches as you'd just load a new workspace pointer.<p>Do any modern CPUs still use an approach like this?
It's funny reading this and then remembering that on top of all this, there's paging (i.e. fetching from hard drive).<p>It's like registers are refrigerators, RAM is like the grocery store around the corner, and Page faults are like the grocery stores in a neighboring town<p>woooooo memory!
While I personally love this answer, I have to admit a basic physical metaphor works. If you remember an answer, it is practically immediate. The further back in your records you have to go to find something, the slower it will be.<p>We have faster ways of recalling notes today than we did in the past, you might say? Well, yeah. In many respects our ram is faster than registers of early computers, too. That all things have gotten faster doesn't change that things which were faster are still faster. (I'd be delighted to know examples where this radically changed somehow.)
Jebus christ, it's because they're close. Like IN the cpu. Not nuzzled not ON, not NEXT TO.<p>Hell, if you know and optimize for registers and don't know why they're fast, you should be shot. Otherwise you're using a language that doesn't really give you control over registers why do you care?<p>Okay okay, I like reading about the blackbird and I know that I know nothing about how it really works other than lots of fuel. Still. Okay, I'm a Hypokrite.
But electrical signals don't propogate at the speed of light in a vacuum ( I didn't read past that point). The signals travel at about 2/3 the speed of light. This is very significant when you look at path lenghts
I'll just leave this here.
<a href="https://www.usenix.org/system/files/1309_14-17_mickens.pdf" rel="nofollow">https://www.usenix.org/system/files/1309_14-17_mickens.pdf</a>
So what's the state of research in breaking out of the Von Neumann approach and going with a RAM-free architecture where the CPU has m(b)illions of registers you just do everything in? Of course it's expensive, but let's say you have effectively infinite dollars, is this a good idea?
Cites distance and cost/power as reasons why "RAM is slow, registers are fast", not a mention of differences between SRAM and (S)DRAM.<p>Not worth reading.