Latency numbers every programmer should know

265 pointsby friggerialmost 13 years ago

31 comments

luckydudealmost 13 years ago

Most of these latencies were measured and written up for a bunch of systems by Carl Staelin and I back in the 1990's. There is a usenix paper that describes how it was done and the benchmarks are open source, you can apt-get them.<a href="http://www.bitmover.com/lmbench/lmbench-usenix.pdf" rel="nofollow">http://www.bitmover.com/lmbench/lmbench-usenix.pdf</a>If you look at the memory latency results carefully, you can easily read off L1, L2, L3, main memory, memory + TLB miss latencies.If you look at them harder, you can read off cache sizes and associativity, cache line sizes, and page size.Here is a 3-D graph that Wayne Scott did at Intel from a tweaked version of the memory latency test.<a href="http://www.bitmover.com/lmbench/mem_lat3.pdf" rel="nofollow">http://www.bitmover.com/lmbench/mem_lat3.pdf</a>His standard interview question is to show the candidate that graph and say "tell me everything you can about this processor and memory system". It's usually a 2 hour conversation if the candidate is good.

评论 #4050318 未加载

评论 #4048701 未加载

评论 #4050957 未加载

评论 #4049778 未加载

larsbergalmost 13 years ago

Honestly, I'd rather programmers know how to _measure_ these numbers than just have them memorized.I mean, if I told them that their machine had L3 cache now, what would they do find out how that changes things? (This comment is also a shameless plug for the fantastic CS:APP book out of CMU).

评论 #4047890 未加载

评论 #4047844 未加载

评论 #4048048 未加载

评论 #4047871 未加载

评论 #4051182 未加载

kjhughesalmost 13 years ago

Anyone who hasn't heard Rear Admiral Grace Murray Hopper describe a nanosecond should check out her classic explanation:<a href="http://www.youtube.com/watch?v=JEpsKnWZrJ8" rel="nofollow">http://www.youtube.com/watch?v=JEpsKnWZrJ8</a>

评论 #4052927 未加载

xb95almost 13 years ago

This reminds me of one of the pages that Google has internally that, very roughly, breaks down the cost of various things so you can calculate equivalencies.As an example of what I mean (i.e., these numbers and equivalencies are completely pulled out of thin-air and I am not asserting these in any way):<pre><code> * 1 Engineer-year = $100,000 * 25T of RAM = 1 Engineer-week * 1ms of display latency = 1 Engineer-year </code></pre> This allows engineeers to calculate tradeoffs when they're building things and to optimize their time for business impact. E.g.: it's not worth optimizing memory usage by itself, Latency is king, Don't waste your time shaving yaks, etc etc.

评论 #4052920 未加载

zippiealmost 13 years ago

These numbers by Jeff Dean are relatively true but need to be refreshed for modern DRAM modules & controllers. Specifically, the main memory latency numbers are more applicable to DDR2 RAM vs the now widely deployed DDR3/DDR4 RAM (more channels = more latency). This has been a industry trend for a while and theres no change on the horizon. Additionally, memory access becomes more expensive because of CPU cross chatter when validating data loads across caches.A potential pitfall with these numbers is they give engineers a false sense of security. They serve as a great conceptual aid - network/disk I/O are expensive and memory access is relatively cheap but engineers take that to an extreme, and get lackadaisical about memory access.When utilizing a massive index (btree) our search engine failed to meet SLA because of memory access patterns. Our engineers tried things at the system (numa policy) and application level (different userspace memory managers, etc.)Ultimately, it all came down to improving the efficiency around memory access. We used Low-Level Data Structure to get the 2x improvements in memory latency:<a href="https://github.com/johnj/llds" rel="nofollow">https://github.com/johnj/llds</a>

dsr_almost 13 years ago

Scaling up to human timeframes, one billion to one:Pull the trigger on a drill in your hand 0.5s Pick up a drill from where you put it down 5s Find the right bit in the case 7s Change bits 25s Go get the toolkit from the truck 100s Go to the store, buy a new tool 3000s Work from noon until 5:30 20000s Part won't be in for three days 250000s Part won't be in until next week 500000s Almost four months 10000000s 8 months 20000000s Five years. 150000000s

评论 #4048177 未加载

评论 #4050695 未加载

jgrahamcalmost 13 years ago

I believe that this originally comes from Norvig's "Teach Yourself to Program in Ten Years" article: <a href="http://norvig.com/21-days.html" rel="nofollow">http://norvig.com/21-days.html</a>

评论 #4047813 未加载

peteretepalmost 13 years ago

Took me forever to find this, but:<a href="https://plus.google.com/112493031290529814667/posts/LvhVwngPqSC" rel="nofollow">https://plus.google.com/112493031290529814667/posts/LvhVwngP...</a>

评论 #4048215 未加载

yuvadamalmost 13 years ago

Is a single-text-file-github-gist the best way to disseminate this piece of knowledge (originally by Peter Norvig, BTW)?What about a comprehensive explanation as to why those numbers actually matter?Meh.

评论 #4047714 未加载

aristusalmost 13 years ago

These are good rules of thumb, but need more context. Plugging an article I wrote about this & other things a couple of years ago for FB engineering: <a href="https://www.facebook.com/note.php?note_id=461505383919" rel="nofollow">https://www.facebook.com/note.php?note_id=461505383919</a>The "DELETE FROM some_table" example is bogus, but the rest is still valid.

some1elsealmost 13 years ago

John Carmack recently used a camera to measure that it takes longer to paint the screen in response to user input, than send a packet accross the Atlantic: <a href="http://superuser.com/questions/419070/transatlantic-ping-faster-than-sending-a-pixel-to-the-screen/419167#419167" rel="nofollow">http://superuser.com/questions/419070/transatlantic-ping-fas...</a>I came across the post when I was looking for USB HID latency (8ms).

EternalFuryalmost 13 years ago

Considering that so many programmers are currently enthralled with JavaScript, Ruby, Python and other very very high level languages, the top half of this chart must look very mysterious and unattainable.

sciurusalmost 13 years ago

One of my favorite writeups on this topic is Gustavo Duarte's "What Your Computer Does While You Wait"<a href="http://duartes.org/gustavo/blog/post/what-your-computer-does-while-you-wait" rel="nofollow">http://duartes.org/gustavo/blog/post/what-your-computer-does...</a>

Symmetryalmost 13 years ago

In practice any out of order processor worth its salt ought to be able to entirely hide L1 cache latencies.

评论 #4047851 未加载

评论 #4047957 未加载

评论 #4050948 未加载

评论 #4048573 未加载

ballootalmost 13 years ago

The thing here that is eye opening to me, and relevant to any web programmer, is that accessing something from memory from another box in the same datacenter is about 25x times as fast as accessing something from disk locally. I would not have guessed that!

评论 #4052803 未加载

perlpimpalmost 13 years ago

Such items are important to web developers and they can use them to justify looking at one or other technology. Or perhaps attempt at least benchmarking and have them as one of the guides in configuring and setting up services. Comes to mind why is Redis can be better then mongodb and in what configuration.As well in discussion about this and that these can be of help too.Adding misaligned memory penalties such as on word boundary and page boundary can enhance such document. This might be a good cheatsheet if one inclined to research and make one.

stiffalmost 13 years ago

I present to you Grace Hopper handing people nanoseconds out:<a href="http://www.youtube.com/watch?v=JEpsKnWZrJ8" rel="nofollow">http://www.youtube.com/watch?v=JEpsKnWZrJ8</a>:)

CookWithMealmost 13 years ago

Also, these numbers don't mean much on their own. E.g. L2 Cache is faster than main memory, but that doesn't help you if you don't know how big your L2 Cache is. Same for main memory vs. disc.E.g. I optimized a computer vision algorithm for using L2 and L3 caches properly (trying to reuse images or parts of images still in the caches). Started off with an Intel Xeon: 256KB L2 Cache, 12MB L3 Cache. Moved on to an AMD Opteron: 512KB L2 Cache (yay), 6MB L3 Cache (damn).Also, the concept of the L2 Cache has changed. Before multi-cores it was bigger and the last-level-cache. Now it has become smaller and the L3 Cache is the last-level-cache, but has some extra issues due to the sharing with other cores.The important concepts every programmer should know are memory hierarchy and network latency. The individual numbers can be looked up on a case-by-case basis.

lallysinghalmost 13 years ago

If this is your cup of tea, have a look at Agner Fog's resources: <a href="http://agner.org/optimize/" rel="nofollow">http://agner.org/optimize/</a>Also, I'd have a look at Intel's VTune or the 'perf' tool that ships with the linux kernel.

SeanLukealmost 13 years ago

How is a mutex lock less expensive than a memory access? Are such things done only in registers nowadays? This doesn't sound right.

评论 #4048882 未加载

评论 #4048677 未加载

al_jamesalmost 13 years ago

What stands out here is how long a disk read takes (especially compared to network latency). Indeed, disk is the new tape.

dockdalmost 13 years ago

Does anyone feel like this is sort of an apples to oranges table? It compares reading one item from L1 cache to reading 1M byte from memory, without adjusting for the amount of data being read (10^6 more). It looks like the data was chosen to minimize the number of digits in the right column.

CookWithMealmost 13 years ago

What about L3 Cache?What about Memory Access on another NUMA Node?What about SSD?Does a mobile phone programmer need to know the access time for disks?Does an embedded system programmer need to know anything of these numbers?Every programmer should know what memory hierarchy and network latency is. (If you learn it by looking at these numbers, fine...)

评论 #4048161 未加载

bunderbunderalmost 13 years ago

I find myself thinking of figures like these every time I see results for benchmarks that barely touch the main memory brought up in debates about the relative merits of various programming languages.

patrickmayalmost 13 years ago

When working on low latency distributed systems I more than once had to remind a client that it's a minimum of 19 milliseconds from New York to London, no matter how fast our software might be.

ryandetzelalmost 13 years ago

Interesting but unnecessary for most programmers today. I'd rather my programmers know the latency of redis vs memcached vs mysql and data type ranges.

评论 #4048181 未加载

JoeAltmaieralmost 13 years ago

Network transmit time is almost irrelevant. It takes orders of magnitude more time to call the kernel, copy data, and reschedule after the operation completes than the wiretime.This paradox was the impetus behind Infiniband, virtual adapters, and a host of other paradigm changes that never caught on.

评论 #4050980 未加载

评论 #4050955 未加载

chmikealmost 13 years ago

Lz4 is faster than zippy and much easier to use. It's a single .h .c file.

hobbyistalmost 13 years ago

How is mutex lock/unlock different from any other memory access?

评论 #4050938 未加载

Morgalmost 13 years ago

Someone should add basic numbers like ns count for 63 cycles modulo and that type of stuff - That'll help bad devs realize why putting another useless cmp inside a loop is dumb, and why alt rows in a table should NEVER be implemented by use of a modulo, for example.Yes I know that's not latency per se but in the end it is too.

评论 #4047848 未加载

评论 #4047895 未加载

评论 #4047916 未加载

评论 #4049194 未加载

mmukhinalmost 13 years ago

2kB over 1Gbps is actually 16ns (i guess they round up to 20)

31 comments

luckydudealmost 13 years ago

评论 #4050318 未加载

评论 #4048701 未加载

评论 #4050957 未加载

评论 #4049778 未加载

larsbergalmost 13 years ago

评论 #4047890 未加载

评论 #4047844 未加载

评论 #4048048 未加载

评论 #4047871 未加载

评论 #4051182 未加载

kjhughesalmost 13 years ago

评论 #4052927 未加载

xb95almost 13 years ago

评论 #4052920 未加载

zippiealmost 13 years ago

dsr_almost 13 years ago

评论 #4048177 未加载

评论 #4050695 未加载

jgrahamcalmost 13 years ago

I believe that this originally comes from Norvig's "Teach Yourself to Program in Ten Years" article: <a href="http://norvig.com/21-days.html" rel="nofollow">http://norvig.com/21-days.html</a>

评论 #4047813 未加载

peteretepalmost 13 years ago

Took me forever to find this, but:<a href="https://plus.google.com/112493031290529814667/posts/LvhVwngPqSC" rel="nofollow">https://plus.google.com/112493031290529814667/posts/LvhVwngP...</a>

评论 #4048215 未加载

yuvadamalmost 13 years ago

评论 #4047714 未加载

aristusalmost 13 years ago

some1elsealmost 13 years ago

EternalFuryalmost 13 years ago

sciurusalmost 13 years ago

Symmetryalmost 13 years ago

In practice any out of order processor worth its salt ought to be able to entirely hide L1 cache latencies.

评论 #4047851 未加载

评论 #4047957 未加载

评论 #4050948 未加载

评论 #4048573 未加载

ballootalmost 13 years ago

评论 #4052803 未加载

perlpimpalmost 13 years ago

stiffalmost 13 years ago

I present to you Grace Hopper handing people nanoseconds out:<a href="http://www.youtube.com/watch?v=JEpsKnWZrJ8" rel="nofollow">http://www.youtube.com/watch?v=JEpsKnWZrJ8</a>:)

CookWithMealmost 13 years ago

lallysinghalmost 13 years ago

SeanLukealmost 13 years ago

How is a mutex lock less expensive than a memory access? Are such things done only in registers nowadays? This doesn't sound right.

评论 #4048882 未加载

评论 #4048677 未加载

al_jamesalmost 13 years ago

What stands out here is how long a disk read takes (especially compared to network latency). Indeed, disk is the new tape.

dockdalmost 13 years ago

CookWithMealmost 13 years ago

评论 #4048161 未加载

bunderbunderalmost 13 years ago

patrickmayalmost 13 years ago

When working on low latency distributed systems I more than once had to remind a client that it's a minimum of 19 milliseconds from New York to London, no matter how fast our software might be.

ryandetzelalmost 13 years ago

Interesting but unnecessary for most programmers today. I'd rather my programmers know the latency of redis vs memcached vs mysql and data type ranges.