Latency Numbers Every Programmer Should Know

232 点作者 pwendell超过 12 年前

22 条评论

lars超过 12 年前

A few years ago I made a (non-interactive) visualization similar to this, that made its way to hackernews [1]. It puts this on a linear scale, so you can get a feel for the baffling difference between your most local caches and your hard drive:<a href="http://i.imgur.com/X1Hi1.gif" rel="nofollow">http://i.imgur.com/X1Hi1.gif</a>Note: This is a huge gif, zoom in to the top.[1] HN discussion: <a href="http://news.ycombinator.com/item?id=702713" rel="nofollow">http://news.ycombinator.com/item?id=702713</a>

评论 #4969875 未加载

jeffffff超过 12 年前

in practice instruction level parallelism and data access patterns make a big difference. if each memory access takes 100 ns but you can do 100 at a time, most of the time there is no practical difference from each memory access taking 1 ns where you can only do 1 at a time. an observer (either a human or another computer over a network) will not notice 100 ns. the numbers listed in this chart are pure latency numbers. what most people care about when talking about cache/memory speed is bandwidth.on an intel sandy bridge processor, each L1 access takes 4 cycles, each L2 access takes 12 cycles, each L3 access takes about 30 cycles, and each main memory access takes about 65 ns. assuming you are using a 3 GHz processor, this would make you think that you can do 750 MT/s from L1, 250 MT/s from L2, 100 MT/s from L3, and 15 MT/s from RAM.now imagine 3 different data access patterns on a 1 GB array:1) sequentially reading through an array2) reading through an array with a random access pattern3) reading through an array in a pattern where the next index is determined by the value at the current index.if you benchmark these 3 access patterns, you will see that:1) sequential access can do 3750 MT/s2) random access can do 75 MT/s3) data-dependent access can do 15 MT/syou might guess that sequential access is fast because it is a very predictable access pattern, but it is still 5x faster than the speed indicated by the latency of L1. maybe you'd think it's prefetching into registers or something? but notice the difference between random access and data-dependent access. this is probably not what you expected at all! why is random access 5x faster than data-dependent access? because on sandy bridge, each hyper threading core can do 5 memory accesses in parallel. this also explains why sequential access seems to be 5x faster than the speed of L1.what does this mean in practice? that to do anything practical with these latency numbers, you also need to know the parallelism of your processor and the parallelizability of your access patterns. the pure latency number only matters if you are limited to one access at a time.

kqr2超过 12 年前

Also recommend Jeff Dean's tech talk Building Software Systems At Google and Lessons Learned which references those latency numbers.Slides: <a href="http://research.google.com/people/jeff/Stanford-DL-Nov-2010.pdf" rel="nofollow">http://research.google.com/people/jeff/Stanford-DL-Nov-2010....</a>Video: <a href="http://www.youtube.com/watch?v=modXC5IWTJI" rel="nofollow">http://www.youtube.com/watch?v=modXC5IWTJI</a>Also, a previous thread on latency numbers:<a href="http://news.ycombinator.com/item?id=4047623" rel="nofollow">http://news.ycombinator.com/item?id=4047623</a>

评论 #4966908 未加载

patrickwiseman超过 12 年前

With timings that are several orders of magnitude in difference I'd just ignore the constants between factors as they change too frequently. Also there is a difference between latency and bandwidth and the chart is simply inconsistent.CPU Cycle ~ 1 time unit Anything you do at all. The cost of doing business.CPU Cache Hit ~ 10 time units Something that was located close to something else that was just accessed by either time or location.Memory Access ~ 100 time units Something that most likely has been accessed recently, but not immediately previously in the code.Disk Access ~ 1,000,000 time units It's been paged out to disk because it's accessed too infrequently or is too big to fit in memory.Network Access ~ 100,000,000 time units It's not even here. Damn. Go grab it from that long series of tubes. Roughly about the same amount of time it takes to blink your eye.

评论 #4968328 未加载

评论 #4970607 未加载

评论 #4968125 未加载

LnxPrgr3超过 12 年前

A couple nits to pick:(1) Mutex performance isn't constant by OS. My own measurements tell me FreeBSD and Linux are around 20x slower than Linux locking and unlocking a mutex in the uncontended case.(2) Contention makes mutexes a lot slower. If a mutex is already locked, you have to wait until it unlocks before you can proceed. It doesn't matter how fast mutexes are if there's a thread locking the one you're waiting on for seconds at a time. And even if other threads aren't holding the mutex long, putting the waiting thread to sleep involves a syscall, which is relatively expensive.

评论 #4969904 未加载

评论 #4969869 未加载

dgudkov超过 12 年前

What implication on my RoR programming is supposed to be from knowing that mutex lock/unlock time is 17ns? Especially, if the web site is hosted on Amazon? Should really every programmer know it?

评论 #4966964 未加载

评论 #4966923 未加载

评论 #4967009 未加载

评论 #4966919 未加载

评论 #4967051 未加载

评论 #4968443 未加载

TallboyOne超过 12 年前

It amazes me that something happening locally (in the ~10ms range, say a database query) is even in the same scale as something traveling halfway around the world. The internet wouldn't really work without this fact, but I'm very thankful for it. Simply incredible.

评论 #4966909 未加载

pooriaazimi超过 12 年前

I don't understand. Why "reading sequentially from hard disk" is getting faster? Is 2012's 5400 rpm different from 2002's 5400 rpm?!

评论 #4966431 未加载

评论 #4966441 未加载

评论 #4974221 未加载

Bjoern超过 12 年前

When we talk latency, nanoseconds etc. I am always fond to remember the things Grace Hopper had to say about it to put it into perspective.<a href="https://www.youtube.com/watch?v=JEpsKnWZrJ8" rel="nofollow">https://www.youtube.com/watch?v=JEpsKnWZrJ8</a>

评论 #4967217 未加载

yxhuvud超过 12 年前

I take it that mutex is unconstested? A contested one would be a lot slower.

markild超过 12 年前

I know how this works, but it's quite depressing to see the "Packet roundtrip CA to Netherlands" unchanged throughout the years.

评论 #4966508 未加载

gilgoomesh超过 12 年前

I think one important latency is left out here...Latency to screen display: approximately 70ms (between 3 and 4 display frames on a typical LCD).<a href="http://en.wikipedia.org/wiki/Display_lag" rel="nofollow">http://en.wikipedia.org/wiki/Display_lag</a>Obviously, you only have 1 display latency in your pipeline but it's still typically the biggest single latency.

评论 #4970630 未加载

Aardwolf超过 12 年前

Very nice.It would be nice if the source code of the page were not shown in a frame below though. That screen space is served better to show the rest of the actual diagram. You can already see the source code of a page with your browser, why waste screenspace and annoying extra scrollbars on that?

评论 #4966898 未加载

jlouis超过 12 年前

I would beware to just extrapolate numbers on a curve. There are some physical constraints which makes it impossible to improve certain numbers by much. And sometimes there are also logical constraints doing the same.We need somewhat accurate measurements on modern hardware as a base. I would not be surprised if some values have gotten worse over the years.

评论 #4966699 未加载

Aardwolf超过 12 年前

I can't wait for the year 2020! 0ns reads from SSD :D

评论 #4967107 未加载

yatsyk超过 12 年前

Very useful!What is the license of the code? I'd like play with visualization some day. Some of the bars is not very comprehensible due to height.

评论 #4966647 未加载

gems超过 12 年前

Almost everybody that studies computer architecture, even a little, is aware of these numbers.This almost seems like promotion.

评论 #4967300 未加载

B-Con超过 12 年前

I have to admit that I was not expecting mutexes to be anywhere near that fast.

评论 #4969985 未加载

hmsimha超过 12 年前

I think the conversion to ms in the rightmost column is off by an order of 10

评论 #4966857 未加载

jahansafd超过 12 年前

how can we decrease the packet round trip from CA to Netherlands?

评论 #4966567 未加载

评论 #4966793 未加载

评论 #4966762 未加载

评论 #4966606 未加载

评论 #4966574 未加载

pithon超过 12 年前

1000 nanoseconds is only "approximately" equal to 1 microsecond?

joeywas超过 12 年前

has latency between ca and norway always been constant as the chart displays?

评论 #4969278 未加载

22 条评论

lars超过 12 年前

评论 #4969875 未加载

jeffffff超过 12 年前

kqr2超过 12 年前

评论 #4966908 未加载

patrickwiseman超过 12 年前

评论 #4968328 未加载

评论 #4970607 未加载

评论 #4968125 未加载

LnxPrgr3超过 12 年前

评论 #4969904 未加载

评论 #4969869 未加载

dgudkov超过 12 年前

What implication on my RoR programming is supposed to be from knowing that mutex lock/unlock time is 17ns? Especially, if the web site is hosted on Amazon? Should really every programmer know it?

评论 #4966964 未加载

评论 #4966923 未加载

评论 #4967009 未加载

评论 #4966919 未加载

评论 #4967051 未加载

评论 #4968443 未加载

TallboyOne超过 12 年前

评论 #4966909 未加载

pooriaazimi超过 12 年前

I don't understand. Why "reading sequentially from hard disk" is getting faster? Is 2012's 5400 rpm different from 2002's 5400 rpm?!

评论 #4966431 未加载

评论 #4966441 未加载

评论 #4974221 未加载

Bjoern超过 12 年前

评论 #4967217 未加载

yxhuvud超过 12 年前

I take it that mutex is unconstested? A contested one would be a lot slower.

markild超过 12 年前

I know how this works, but it's quite depressing to see the "Packet roundtrip CA to Netherlands" unchanged throughout the years.

评论 #4966508 未加载

gilgoomesh超过 12 年前

评论 #4970630 未加载

Aardwolf超过 12 年前

评论 #4966898 未加载

jlouis超过 12 年前

评论 #4966699 未加载

Aardwolf超过 12 年前

I can't wait for the year 2020! 0ns reads from SSD :D

评论 #4967107 未加载

yatsyk超过 12 年前

Very useful!What is the license of the code? I'd like play with visualization some day. Some of the bars is not very comprehensible due to height.

评论 #4966647 未加载

gems超过 12 年前

Almost everybody that studies computer architecture, even a little, is aware of these numbers.This almost seems like promotion.