Memory bandwidth

170 pointsby deafcalculusabout 8 years ago

13 comments

nortieroabout 8 years ago

The article is very optimistic about memory availability per cycle, reality is way worse.As an example, on my Macbook Air 2011 with ~10 GB/s of maximum ram bandwidth, random access to memory can take 100 time more than a sequential one.This in C, with full optimizations and using a very low overhead read loop.Using the same metrics of the author:best case: ~ 3 bytes per cycle(around 6 Gigabyte per second of available bandwidth)worst case: ~ 0.024 bytes per cycle (every scheduler, prefetch, already open column mostly defied)Note that worst case uses 10 seconds (!) to read and sum in a random way all the cells of an array of 100.000.000 of 4 byte integers, exactly once. Main loop is light enough not to influence the test.That's about 40 megabytes per second out of 6.000 available.What can I say.. CPU designers are truly wizards!

评论 #14099321 未加载

评论 #14100868 未加载

adrianmonkabout 8 years ago

It's good to see some attention paid to the subject, but it's not exactly a new revelation.Probably 15-20 years ago, my computer architecture professor commented that, while CPUs of the past had relatively anemic number-crunching powers, all the pipelining and high clock speeds and other advancements in more recent CPUs had changed that, but corresponding advancements in memory bandwidth had not been made.Which in turn meant that the way you go about optimizing code needed to change. In the past it had been mostly about finding ways to eliminate instructions or simplify expressions, because the things that were holding you back were the ALU and the ability to plow through instructions. With all those things sped up massively but less improvement in RAM, it became very important to start thinking about memory access patterns and caches.We even had a homework assignment to optimize a matrix multiply, and the lesson learned was that the dominating factor wasn't what the code looks like the innermost loop, it was which direction you proceed through the matrices (row by row vs. column by column) because that determines memory access patterns.

luckydudeabout 8 years ago

What a refreshing article. This guy gets it and put things into perspective in a way that you (or at least I) don't see very often.Worth a read if you are just scanning the comments.

socmagabout 8 years ago

Really great article and comments :)Just as a point of reference, I currently see around 1.25 - 1.5 op/cycle on carefully crafted highly parallel lock-free, stall-free code... say running on 8 threads. Code that has 0.01% branch misprediction.Unfortunately in my case as others mention, access to the I/O ports and memory latency, is a real limiting factor. The CPU is just... waitingGetting to the Holy Grail that Ryg talks about of 3 instructions per cycle is really hard with non-vectorizable workloads - like screwing around with hash tables that have no chance of fitting in L1/L3, and not being able to really make much use of SIMD, even if you are paying attention to cache-lines.Most apps barely scrape by at 0.5 instructions/cycle or worse and spend most of the time bouncing on the kernel for stupid stuff. Not good.Absolutely <3 performance freaks!

white-flameabout 8 years ago

A commonly recurring subject on the 6502 forums is what a "modern" 6502 would look like. It always boils down to not really being able to replicate the 1-cycle access to any byte in RAM. Changes to the memory access model requires changing everything to being unrecognizable as a 6502 derivative, in programming style.Of course, you could put 64KB of SRAM on the CPU die, but the size and power of the RAM would dwarf the processor, and you'd get an old-school 6502, arguably not a modern take on the concept. If you want more memory, you simply can't replicate the 1980s access model at anything approaching today's speeds.

评论 #14101651 未加载

评论 #14102794 未加载

spullaraabout 8 years ago

I remember when DRAM "wait states" were 0, 1 or 2. They aren't advertised anymore because they are 1-2 orders of magnitude worse than that now.<a href="https://en.wikipedia.org/wiki/Wait_state" rel="nofollow">https://en.wikipedia.org/wiki/Wait_state</a>

pslamabout 8 years ago

> Note that we’re 40 years of Moore’s law scaling later and the available memory bandwidth per instruction has gone down substantially.This is unfair, and (this is me being unfair now) this article is missing the woods for the trees.There were engineering pressures which resulted in the current ratios. I think it is fairer to say that the current situation — where there's about (hand-waving) 1 byte per instruction of bandwidth per core — reflects the kinds of tasks we expect our machines to be doing. It is very rare to find a task which is memory speed bound. There's almost always substantial processing to be done with data.It's not even that hard to increase memory bandwidth. You "just" double up memory channels. This is of course expensive, which in turn is a back-pressure which results in architectures designed around the current sweet-spot.I'm also puzzled the author thinks the situation is "worse". Pretty much every desktop class machine I've used from about 1990-2005 was extremely starved of memory bandwidth, and cores did a far worse job of hiding latency (OOO renames etc). What we have today feels fairly comfortable, to me at least, with some outlier tasks where you might want more (and then obtain specialist hardware).This is a long-winded way of saying: the current core vs memory speed ratio is a sweet-spot of cost vs efficiency, and works well given the tasks and algorithms we execute on these machines. What we had in the olden days was just a case of unoptimized architecture, which hadn't converged yet.

评论 #14101714 未加载

评论 #14101294 未加载

PaulHouleabout 8 years ago

Memory bandwidth holds back a number of "revolutionary" advances in computing, both in the sense of FPGAs, ASICs, and new processor types, not to mention Indium Phosphide parts that clock at 50 GHz and could go to 200 or more.

ant6nabout 8 years ago

"Code that runs OK-ish on that CPU averages around 1 instruction per cycle, well-optimized code around 3 instructions per cycle."Is unoptimized code really that bad?I thought between modern compilers and supporting 4 op/cycle with out-of-order execution, a new Kaby Lake cpu would get more than 1 cycle overall. Or are the branch delays just killing the performance? (Code is small relative to data, most of it should be in cache most of the time)

评论 #14098284 未加载

评论 #14098155 未加载

评论 #14097826 未加载

评论 #14098491 未加载

评论 #14097609 未加载

评论 #14098412 未加载

评论 #14097310 未加载

pqrabout 8 years ago

See also Roofline Model <a href="https://en.wikipedia.org/wiki/Roofline_model" rel="nofollow">https://en.wikipedia.org/wiki/Roofline_model</a>and Memory Wall <a href="https://www.google.co.in/search?q=memory+wall+computer+architecture" rel="nofollow">https://www.google.co.in/search?q=memory+wall+computer+archi...</a>

评论 #14099807 未加载

KaiserProabout 8 years ago

One Interesting quirk we found with Nvidia cards was the CPU-GPU transfer speed was affected by how many physical CPUs you had.I assumed it was because with newer intel CPUs the PCI bridge moved to the die, away from the motherboard, so there was CPU affinity for the GPU.The difference in bandwidth was around 15%

评论 #14101973 未加载

评论 #14099663 未加载

评论 #14099682 未加载

socmagabout 8 years ago

One thing I'd love to see is to have enough L1 data cache per core for a modest amount of stack space.Not so critical on GPU's but would make a huge difference for CPU's and languages that can take benefit.Give me a meg or two to play with. Would make a huge difference for data heavy workloads.You could even go as far as having a separate cache just for stack.I mean, by its very definition it is isolated. It's the "register file" of CISC machines

评论 #14102488 未加载

xchaoticabout 8 years ago

Should we design the processing units differently then?There's been talk about hundreds of cores, all with local memory, but this will only speed up a certain subset of computing problems...

13 comments

nortieroabout 8 years ago

评论 #14099321 未加载

评论 #14100868 未加载

adrianmonkabout 8 years ago

luckydudeabout 8 years ago

What a refreshing article. This guy gets it and put things into perspective in a way that you (or at least I) don't see very often.Worth a read if you are just scanning the comments.

socmagabout 8 years ago

white-flameabout 8 years ago

评论 #14101651 未加载

评论 #14102794 未加载

spullaraabout 8 years ago

pslamabout 8 years ago

评论 #14101714 未加载

评论 #14101294 未加载

PaulHouleabout 8 years ago

ant6nabout 8 years ago

评论 #14098284 未加载

评论 #14098155 未加载

评论 #14097826 未加载

评论 #14098491 未加载

评论 #14097609 未加载

评论 #14098412 未加载

评论 #14097310 未加载

pqrabout 8 years ago

评论 #14099807 未加载

KaiserProabout 8 years ago

评论 #14101973 未加载

评论 #14099663 未加载

评论 #14099682 未加载

socmagabout 8 years ago

评论 #14102488 未加载

xchaoticabout 8 years ago

Should we design the processing units differently then?There's been talk about hundreds of cores, all with local memory, but this will only speed up a certain subset of computing problems...