A friend of mine is working on a custom memory controller for specialized devices at Google Cloud. The software people are giving the hardware people tremendous pressure in making sure the memory controller works well with all common powers-of-two strides, besides random strides.<p>There are some (very primitive by software standards) "hash functions" implemented in hardware to make that work. IIRC at that time they treat the relevant bits in an address as a vector in GF(2) and multiply by a certain hardcoded invertible matrix. (Of course there are properties for that matrix besides being invertible.) The idea is that if a single bit in the input address changes, all output bits should have equal probability of flipping and not flipping (the avalanche effect).
I can't tell you how many large-power-of-two stride loops I've seen. I'm going to propose that CPU (or compiler) makers make sure they aren't 10x slower.
I’m confused by the cache graphs: it looks like the direct-mapped cache has multiple lines drawn from a single cache line to different blocks in main memory.<p>From what I understand, a hardware cache wouldn’t store a line multiple times in main memory.
Old Memory from the 1990's: Some high-end (back then) CPU had large (for then), fast L2 caches...which were merely 2(?)-way associative. But to mitigate the performance hit from cache thrashing, it also had a small-ish 64(?)-way associative "helper" D-cache.<p>(It might have been HP PA-RISC, from an article in Byte Magazine - but that's digging deep into the old memory murk.)<p>Might anyone recall more?
I have a(nother) question. If you’re iterating through a long array with chunksize larger than the cache line size, then every iteration of the loop will cause a cache miss. So it shouldn’t matter if it’s a power of two or not, every call to a[i]++ will be a cache miss. What gives?