A few years ago I made a (non-interactive) visualization similar to this, that made its way to hackernews [1]. It puts this on a linear scale, so you can get a feel for the baffling difference between your most local caches and your hard drive:<p><a href="http://i.imgur.com/X1Hi1.gif" rel="nofollow">http://i.imgur.com/X1Hi1.gif</a><p>Note: This is a huge gif, zoom in to the top.<p>[1] HN discussion: <a href="http://news.ycombinator.com/item?id=702713" rel="nofollow">http://news.ycombinator.com/item?id=702713</a>
in practice instruction level parallelism and data access patterns make a big difference. if each memory access takes 100 ns but you can do 100 at a time, most of the time there is no practical difference from each memory access taking 1 ns where you can only do 1 at a time. an observer (either a human or another computer over a network) will not notice 100 ns. the numbers listed in this chart are pure latency numbers. what most people care about when talking about cache/memory speed is bandwidth.<p>on an intel sandy bridge processor, each L1 access takes 4 cycles, each L2 access takes 12 cycles, each L3 access takes about 30 cycles, and each main memory access takes about 65 ns. assuming you are using a 3 GHz processor, this would make you think that you can do 750 MT/s from L1, 250 MT/s from L2, 100 MT/s from L3, and 15 MT/s from RAM.<p>now imagine 3 different data access patterns on a 1 GB array:<p>1) sequentially reading through an array<p>2) reading through an array with a random access pattern<p>3) reading through an array in a pattern where the next index is determined by the value at the current index.<p>if you benchmark these 3 access patterns, you will see that:<p>1) sequential access can do 3750 MT/s<p>2) random access can do 75 MT/s<p>3) data-dependent access can do 15 MT/s<p>you might guess that sequential access is fast because it is a very predictable access pattern, but it is still 5x faster than the speed indicated by the latency of L1. maybe you'd think it's prefetching into registers or something? but notice the difference between random access and data-dependent access. this is probably not what you expected at all! why is random access 5x faster than data-dependent access? because on sandy bridge, each hyper threading core can do 5 memory accesses in parallel. this also explains why sequential access seems to be 5x faster than the speed of L1.<p>what does this mean in practice? that to do anything practical with these latency numbers, you also need to know the parallelism of your processor and the parallelizability of your access patterns. the pure latency number only matters if you are limited to one access at a time.
Also recommend Jeff Dean's tech talk <i>Building Software Systems At Google and Lessons Learned</i> which references those latency numbers.<p>Slides: <a href="http://research.google.com/people/jeff/Stanford-DL-Nov-2010.pdf" rel="nofollow">http://research.google.com/people/jeff/Stanford-DL-Nov-2010....</a><p>Video: <a href="http://www.youtube.com/watch?v=modXC5IWTJI" rel="nofollow">http://www.youtube.com/watch?v=modXC5IWTJI</a><p>Also, a previous thread on latency numbers:<p><a href="http://news.ycombinator.com/item?id=4047623" rel="nofollow">http://news.ycombinator.com/item?id=4047623</a>
With timings that are several orders of magnitude in difference I'd just ignore the constants between factors as they change too frequently. Also there is a difference between latency and bandwidth and the chart is simply inconsistent.<p>CPU Cycle ~ 1 time unit
Anything you do at all. The cost of doing business.<p>CPU Cache Hit ~ 10 time units
Something that was located close to something else that was just accessed by either time or location.<p>Memory Access ~ 100 time units
Something that most likely has been accessed recently, but not immediately previously in the code.<p>Disk Access ~ 1,000,000 time units
It's been paged out to disk because it's accessed too infrequently or is too big to fit in memory.<p>Network Access ~ 100,000,000 time units
It's not even here. Damn. Go grab it from that long series of tubes. Roughly about the same amount of time it takes to blink your eye.
A couple nits to pick:<p>(1) Mutex performance isn't constant by OS. My own measurements tell me FreeBSD and Linux are around 20x slower than Linux locking and unlocking a mutex in the uncontended case.<p>(2) Contention makes mutexes a lot slower. If a mutex is already locked, you have to wait until it unlocks before you can proceed. It doesn't matter how fast mutexes are if there's a thread locking the one you're waiting on for seconds at a time. And even if other threads aren't holding the mutex long, putting the waiting thread to sleep involves a syscall, which is relatively expensive.
What implication on my RoR programming is supposed to be from knowing that mutex lock/unlock time is 17ns? Especially, if the web site is hosted on Amazon? Should really every programmer know it?
It amazes me that something happening locally (in the ~10ms range, say a database query) is even in the same scale as something traveling halfway around the world. The internet wouldn't really work without this fact, but I'm very thankful for it. Simply incredible.
When we talk latency, nanoseconds etc. I am always fond to remember the things Grace Hopper had to say about it to put it into perspective.<p><a href="https://www.youtube.com/watch?v=JEpsKnWZrJ8" rel="nofollow">https://www.youtube.com/watch?v=JEpsKnWZrJ8</a>
I think one important latency is left out here...<p>Latency to screen display: approximately 70ms (between 3 and 4 display frames on a typical LCD).<p><a href="http://en.wikipedia.org/wiki/Display_lag" rel="nofollow">http://en.wikipedia.org/wiki/Display_lag</a><p>Obviously, you only have 1 display latency in your pipeline but it's still typically the biggest single latency.
Very nice.<p>It would be nice if the source code of the page were not shown in a frame below though. That screen space is served better to show the rest of the actual diagram. You can already see the source code of a page with your browser, why waste screenspace and annoying extra scrollbars on that?
I would beware to just extrapolate numbers on a curve. There are some physical constraints which makes it impossible to improve certain numbers by much. And sometimes there are also logical constraints doing the same.<p>We need somewhat accurate measurements on modern hardware as a base. I would not be surprised if some values have gotten <i>worse</i> over the years.