Most of these latencies were measured and written up for a bunch of systems by Carl Staelin and I back in the 1990's. There is a usenix paper that describes how it was done and the benchmarks are open source, you can apt-get them.<p><a href="http://www.bitmover.com/lmbench/lmbench-usenix.pdf" rel="nofollow">http://www.bitmover.com/lmbench/lmbench-usenix.pdf</a><p>If you look at the memory latency results carefully, you can easily read off L1, L2, L3, main memory, memory + TLB miss latencies.<p>If you look at them harder, you can read off cache sizes and associativity, cache line sizes, and page size.<p>Here is a 3-D graph that Wayne Scott did at Intel from a tweaked version of the memory latency test.<p><a href="http://www.bitmover.com/lmbench/mem_lat3.pdf" rel="nofollow">http://www.bitmover.com/lmbench/mem_lat3.pdf</a><p>His standard interview question is to show the candidate that graph and say "tell me everything you can about this processor and memory system". It's usually a 2 hour conversation if the candidate is good.
Honestly, I'd rather programmers know how to _measure_ these numbers than just have them memorized.<p>I mean, if I told them that their machine had L3 cache now, what would they do find out how that changes things? (This comment is also a shameless plug for the fantastic CS:APP book out of CMU).
Anyone who hasn't heard Rear Admiral Grace Murray Hopper describe a nanosecond should check out her classic explanation:<p><a href="http://www.youtube.com/watch?v=JEpsKnWZrJ8" rel="nofollow">http://www.youtube.com/watch?v=JEpsKnWZrJ8</a>
This reminds me of one of the pages that Google has internally that, very roughly, breaks down the cost of various things so you can calculate equivalencies.<p>As an example of what I mean (i.e., these numbers and equivalencies are completely pulled out of thin-air and I am not asserting these in any way):<p><pre><code> * 1 Engineer-year = $100,000
* 25T of RAM = 1 Engineer-week
* 1ms of display latency = 1 Engineer-year
</code></pre>
This allows engineeers to calculate tradeoffs when they're building things and to optimize their time for business impact. E.g.: it's not worth optimizing memory usage by itself, Latency is king, Don't waste your time shaving yaks, etc etc.
These numbers by Jeff Dean are relatively true but need to be refreshed for modern DRAM modules & controllers. Specifically, the main memory latency numbers are more applicable to DDR2 RAM vs the now widely deployed DDR3/DDR4 RAM (more channels = more latency). This has been a industry trend for a while and theres no change on the horizon. Additionally, memory access becomes more expensive because of CPU cross chatter when validating data loads across caches.<p>A potential pitfall with these numbers is they give engineers a false sense of security. They serve as a great conceptual aid - network/disk I/O are expensive and memory access is <i>relatively</i> cheap but engineers take that to an extreme, and get lackadaisical about memory access.<p>When utilizing a massive index (btree) our search engine failed to meet SLA because of memory access patterns. Our engineers tried things at the system (numa policy) and application level (different userspace memory managers, etc.)<p>Ultimately, it all came down to improving the efficiency around memory access. We used Low-Level Data Structure to get the 2x improvements in memory latency:<p><a href="https://github.com/johnj/llds" rel="nofollow">https://github.com/johnj/llds</a>
Scaling up to human timeframes, one billion to one:<p>Pull the trigger on a drill in your hand 0.5s
Pick up a drill from where you put it down 5s
Find the right bit in the case 7s
Change bits 25s
Go get the toolkit from the truck 100s
Go to the store, buy a new tool 3000s
Work from noon until 5:30 20000s
Part won't be in for three days 250000s
Part won't be in until next week 500000s
Almost four months 10000000s
8 months 20000000s
Five years. 150000000s
I believe that this originally comes from Norvig's "Teach Yourself to Program in Ten Years" article: <a href="http://norvig.com/21-days.html" rel="nofollow">http://norvig.com/21-days.html</a>
Took me forever to find this, but:<p><a href="https://plus.google.com/112493031290529814667/posts/LvhVwngPqSC" rel="nofollow">https://plus.google.com/112493031290529814667/posts/LvhVwngP...</a>
Is a single-text-file-github-gist the best way to disseminate this piece of knowledge (originally by Peter Norvig, BTW)?<p>What about a comprehensive explanation as to why those numbers actually matter?<p>Meh.
These are good rules of thumb, but need more context. Plugging an article I wrote about this & other things a couple of years ago for FB engineering: <a href="https://www.facebook.com/note.php?note_id=461505383919" rel="nofollow">https://www.facebook.com/note.php?note_id=461505383919</a><p>The "DELETE FROM some_table" example is bogus, but the rest is still valid.
John Carmack recently used a camera to measure that it takes longer to paint the screen in response to user input, than send a packet accross the Atlantic:
<a href="http://superuser.com/questions/419070/transatlantic-ping-faster-than-sending-a-pixel-to-the-screen/419167#419167" rel="nofollow">http://superuser.com/questions/419070/transatlantic-ping-fas...</a><p>I came across the post when I was looking for USB HID latency (8ms).
Considering that so many programmers are currently enthralled with JavaScript, Ruby, Python and other very very high level languages, the top half of this chart must look very mysterious and unattainable.
One of my favorite writeups on this topic is Gustavo Duarte's "What Your Computer Does While You Wait"<p><a href="http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait" rel="nofollow">http://duartes.org/gustavo/blog/post/what-your-computer-does...</a>
The thing here that is eye opening to me, and relevant to any web programmer, is that accessing something from memory from another box in the same datacenter is about 25x times as fast as accessing something from disk locally. I would not have guessed that!
Such items are important to web developers and they can use them to justify looking at one or other technology. Or perhaps attempt at least benchmarking and have them as one of the guides in configuring and setting up services. Comes to mind why is Redis can be better then mongodb and in what configuration.<p>As well in discussion about this and that these can be of help too.<p>Adding misaligned memory penalties such as on word boundary and page boundary can enhance such document. This might be a good cheatsheet if one inclined to research and make one.
I present to you Grace Hopper handing people nanoseconds out:<p><a href="http://www.youtube.com/watch?v=JEpsKnWZrJ8" rel="nofollow">http://www.youtube.com/watch?v=JEpsKnWZrJ8</a><p>:)
Also, these numbers don't mean much on their own. E.g. L2 Cache is faster than main memory, but that doesn't help you if you don't know how big your L2 Cache is. Same for main memory vs. disc.<p>E.g. I optimized a computer vision algorithm for using L2 and L3 caches properly (trying to reuse images or parts of images still in the caches). Started off with an Intel Xeon: 256KB L2 Cache, 12MB L3 Cache. Moved on to an AMD Opteron: 512KB L2 Cache (yay), 6MB L3 Cache (damn).<p>Also, the concept of the L2 Cache has changed. Before multi-cores it was bigger and the last-level-cache. Now it has become smaller and the L3 Cache is the last-level-cache, but has some extra issues due to the sharing with other cores.<p>The important concepts every programmer should know are memory hierarchy and network latency. The individual numbers can be looked up on a case-by-case basis.
If this is your cup of tea, have a look at Agner Fog's resources: <a href="http://agner.org/optimize/" rel="nofollow">http://agner.org/optimize/</a><p>Also, I'd have a look at Intel's VTune or the 'perf' tool that ships with the linux kernel.
Does anyone feel like this is sort of an apples to oranges table? It compares reading one item from L1 cache to reading 1M byte from memory, without adjusting for the amount of data being read (10^6 more). It looks like the data was chosen to minimize the number of digits in the right column.
What about L3 Cache?<p>What about Memory Access on another NUMA Node?<p>What about SSD?<p>Does a mobile phone programmer need to know the access time for disks?<p>Does an embedded system programmer need to know anything of these numbers?<p>Every programmer should know what memory hierarchy and network latency is. (If you learn it by looking at these numbers, fine...)
I find myself thinking of figures like these every time I see results for benchmarks that barely touch the main memory brought up in debates about the relative merits of various programming languages.
When working on low latency distributed systems I more than once had to remind a client that it's a minimum of 19 milliseconds from New York to London, no matter how fast our software might be.
Interesting but unnecessary for most programmers today. I'd rather my programmers know the latency of redis vs memcached vs mysql and data type ranges.
Network transmit time is almost irrelevant. It takes orders of magnitude more time to call the kernel, copy data, and reschedule after the operation completes than the wiretime.<p>This paradox was the impetus behind Infiniband, virtual adapters, and a host of other paradigm changes that never caught on.
Someone should add basic numbers like ns count for 63 cycles modulo and that type of stuff -
That'll help bad devs realize why putting another useless cmp inside a loop is dumb, and why alt rows in a table should NEVER be implemented by use of a modulo, for example.<p>Yes I know that's not latency per se but in the end it is too.