I respect Brendan, and although it is an interesting article, I have to disagree with him: The OS tells you about OS CPU utilization, not CPU micro-architecture functional unit utilization. So if the OS uses a CPU for running code until a physical interrupt or a software trap happens, in that period the CPU has been doing work. Unless the CPU could be able to do a "free" context switch to a cached area not having to wait for e.g. a cache miss (hint: SMT/"hyperthreading" was invented exactly for that use case), the CPU would be actually busy.<p>If in the future (TM) using CPU performance counters for every process becomes really "free" (as in "gratis" or "cheap"), the OS could report bad performing processes because the reasons exposed in the article (low IPC indicating poor memory access patterns, unoptimized code, code using too small buffers for I/O -causing system performance degradation because excessive kernel processing time because-, etc.), showing the user that despite having high CPU usage, the CPU is not getting enough work done (in that sense I could agree with the article).
The problem is that IPC is also a crude metric. Even leaving aside fundamental algorithmic differences, an implementation of some algorithm with IPC of 0.5 is not necessarily faster than an implementation that somehow manages to hit every execution port and deliver an IPC of 4.<p>I can improve IPC of almost any algorithm (assuming it is not already very high) by slipping lots of useless or nearly useless cheap integer operations into the code.<p>People always tell you "branch misses are bad" and "cache misses are bad". You should always ask: "compared to what"? If it was going to take you 20 cycles worth of frenzied, 4 instructions per clock, work to calculate something you could keep in a big table in L2 (assuming that you aren't contending for it) you might be better off eating the cache miss.<p>Similarly you could "improve" your IPC by avoiding branch misses (assuming no side effects) by calculating both sides of a unpredictable branch and using CMOV. This will save you branch misses and increase your IPC, but it may not improve the speed of your code (if the cost of the work is bigger than the cost of the branch misses).
IPC is amazing. We had some "slow" code, did a little profiling, and found that a hash lookup function was showing very low IPC about half the time. Turns out, the hash table was mapped across two memory domains on the server (NUMA) and the memory lookup from one processor the other processors memory was significantly slower.<p>perf on a binary that is properly instrumented (so it can show you per-source-line or per-instruction data) is really ghreat.
I use `htop` for all of my Linux machines. It's great software. But one of my biggest gripes is that "Detailed CPU Time" (F2 -> Display options -> Detailed CPU time) is not enabled by default.<p>Enabling it allows you to see a clearer picture of not just stalls but also CPU steal from "noisy neighbors" -- guests also assigned to the same host.<p>I've seen CPU steal cause kernel warnings of "soft-lockups". I've also seen zombie processes occur. I suspect they're related but it's only anecdotal: I'm not sure how to investigate.<p>It's pretty amazing what kind of patterns you can identify when you've got stuff like that running. Machine seems to be non-responsive? Open up htop, see lots of grey... okay so since <i>all</i> data is on the network, that means that it's a data bottleneck; over the network means it could be bottlenecked at network bandwidth or the back-end SAN could be bottlenecked.<p>Fun fact: Windows Server doesn't like not having its disk IO not be serviced for minutes at a time. That's not a fun way to have another team come over and get angry because you're bluescreening their production boxes.
Perf is fascinating to dive into. If you are using C and gcc you can use record/report that show you line by line and instruction by instruction where you are getting slowdowns.<p>One of my favorite school assignments was we were given an intentionally bad implementation of the Game of Life compiled with -O3 and trying to get it to run faster without changing compiler flags. It's sort of mind boggling how fast computers can do stuff if you can reduce the problem to fixed stride for loops over arrays that can be fully pipelined.
We are what we measure.<p>Very true that 100% CPU Utilization is often waiting on bus traffic (loading caches, loading ram, loading instructions, decoding instructions) only rarely is the CPU _doing_ useful work.<p>The context of what you are measuring depends if this is useful work or not. The initial access of a buffer almost universally stalls (unless you prefetched 100+ instructions ago). But starting to stream this data into L1 is useful work.<p>Aiming for 100%+ IPC is _beyond_ difficult even for simple algorithms and critical hot path functions. You not only require assembler cooperation (to assure decoder alignment), but you need to know _what_ processor you are running on to know the constraints of its decoder, uOP cache, and uOP cache alignment.<p>---<p>Perf gives you ability to cache per PID counters. Generally just look at Cycles Passed vs Instructions decoded.<p>This gives you a general overview of stalls. Once you dig into IPC, front end stalls, back end stalls. You start to see the turtles.
At Tera, we were able to issue 1 instruction/cycle/CPU. The hardware could measure the number of missed opportunities (we called them phantoms) over a period of time, so we could report percent utilization accurately. Indeed, we could graph it over time and map periods of high/low utilization back to points in the code (typically parallel/serial loops), with notes about what the compiler thought was going on. It was a pretty useful arrangement.
Your CPU will execute a program just as fast at 5% than as 75%.<p>We honestly need a tool that compares I/O, memory fetch, cache-miss, TLB misses, page-outs, CPU Usage, interrupts, context-swaps, etc all in one place.
There's also loadavg. I've encountered a lot of people who think that a high loadavg MUST imply a lot of CPU use. Not on Linux, at least:<p>> <i>The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes.</i><p>Nobody knows about the "or waiting for disk I/O (state D)" bit. So a bunch of processes doing disk I/O can cause loadavg spikes, but there can still be plenty of spare CPU.
It seems to me that the CPU utlization metric (from /proc/stat) has far more problems than misreporting memory stalls.<p>As far as I understand it, the metric works as follows: At every clock interrupt (every 4ms on my machine) the system checks which process is currently running, before invoking the scheduler:
- If the idle process idle time is accounted.
- Otherwise the processer is regarded as utilized.<p>(This is what I got from reading the docs, and digging into the source code. I am not 100% confident I understand this completely at this point. If you know better please tell me!)<p>There are many problems with this approach:
Every time slice (4ms) is accounte either as completely utilized on completely free. There are many reasons for processes going on CPU or off CPU outside of clock interrupts. Blocking syscalls are the most obvious one.
In the end a time slice might be utilized by multiple different processes and interrupt handlers but if at the very end of the time slice the idle thread is scheduled on CPU the whole slice is counted as idle time!<p>See also:
<a href="https://github.com/torvalds/linux/blob/master/Documentation/cpu-load.txt" rel="nofollow">https://github.com/torvalds/linux/blob/master/Documentation/...</a>
The article is interesting, but IPC is the wrong metric to focus on. Frankly, the only thing we should care about when it comes to performance is time to finish a task. It doesn't matter if it takes more instructions to compute something, as long as it's done faster.<p>The other metric you can mix with execution time is energy efficiency. That's about it. IPC is not a very good proxy. Fun to look at, but likely to be highly misleading.
Instructions per cycle:
<a href="https://en.wikipedia.org/wiki/Instructions_per_cycle" rel="nofollow">https://en.wikipedia.org/wiki/Instructions_per_cycle</a><p>What does IPC tell me about where my code could/should be async so that it's not stalled waiting for IO? Is combined IO rate a useful metric for this?<p>There's an interesting "Cost per GFLOPs" table here:
<a href="https://en.wikipedia.org/wiki/FLOPS" rel="nofollow">https://en.wikipedia.org/wiki/FLOPS</a><p>Btw these are great, thanks:
<a href="http://www.brendangregg.com/linuxperf.html" rel="nofollow">http://www.brendangregg.com/linuxperf.html</a><p>( I still couldn't fill this out if I tried: <a href="http://www.brendangregg.com/blog/2014-08-23/linux-perf-tools-linuxcon-na-2014.html" rel="nofollow">http://www.brendangregg.com/blog/2014-08-23/linux-perf-tools...</a> )
Another related tool I found interesting: perf c2c<p>This will let us find the false sharing cost (cache contention etc).<p><a href="https://joemario.github.io/blog/2016/09/01/c2c-blog/" rel="nofollow">https://joemario.github.io/blog/2016/09/01/c2c-blog/</a>
By clicking through some links on the article I found this:
<a href="http://www.brendangregg.com/blog/2014-10-31/cpi-flame-graphs.html" rel="nofollow">http://www.brendangregg.com/blog/2014-10-31/cpi-flame-graphs...</a><p>Now I wonder how easy and manual work it would be to do these combined flamegraphs with CPI/IPC information? My cursory search didn't find nary a mention after 2015... Perhaps this is still hard and complicated.<p>To me it seems really useful to know <i>why</i> a function takes so long to work (waiting or calculating) and not "merely" how long it takes. Even if the information is not perfectly reliable nor can't be measured without effect on execution.
Interestingly IPCs are also used to verify new chipsets in embedded companies. Run the same code with newer generation chipset and see if IPC is better than the previous. IPCs are one of the main criteria if the new chipset is a hit or miss (others are power..)
I didn't know about tiptop, and it sounds interesting. Running it, though, it only shows "?" in Ncycle, Minstr, IPC, %MISS, %BMIS and %BUS colums for a lot of processes, including for, but not limited to, Firefox.
I can't see a mention of it here, or on the original page, so IMO it's worth pointing out a utility that you will most likely already have installed on your Linux machine: <i>vmstat</i>. Just run:<p><pre><code> vmstat 3
</code></pre>
And you'll get a running breakdown of CPU usage (split into user/system), and a breakdown of 'idle' time (split into actual idle time and time waiting for I/O (or some kinds of locks).<p>The '3' in the command line is just how long the stats are averaged over, I'd recommend using 3+ to average out bursts of activity on a fairly steady-state system.
CPU util might be misleading, but cpu idle under a threshold at peak [1] means you need more idle cpu and you can get that by getting more machines, getting better machines, or getting better code.<p>Only when I'm trying to get better code, do I need to care about IPC, and cache stalls. I may also want better code to improve the overall speed of execution, too.<p>[1] (~50% if you have a pair of redundant machines and load scales nicely, maybe 20% idle or even less if you have a large number of redundant machines and the load balances easily over them)
> You can figure out what %CPU really means by using additional metrics, including instructions per cycle (IPC)<p>Correct me if I am wrong, but this won't work for spinlocks in busy loops: you do have a lot of instructions being executed, but the whole point of the loop is to wait for the cache to synchronize, and as such, this should be taken as "stalled".
CPU frequency scaling can also lead to somewhat unintuitive results. On few occasions I've seen CPU load % increasing significantly after code was optimized. Optimization was still actually valid, and the actual executed instructions per work item went down, but the CPU load % went up since OS decided to clock down the CPU due to reduced workload.
I think thinking about the CPU add mainly the ALU seems myopic.
The job of the CPU is to get data into the right pipeline at the right time. Waiting for a cache miss means it's busy doing its job. Thus, CPU busy is a reasonable metric the way it is currently defines and measured. (After all, the memory controller is part of the CPU these days.)
This article is not as silly as it could be.<p>Let me help.<p>Look, CPU utilization is misleading. Did you forget to use -O2 when compiling your code? Oops, CPU utilization is now including all sorts of wasteful instructions that don't make forward progress, including pointless moves of dead data into registers.<p>Are you using Python or Perl? CPU utilization is misleading; it's counting all that time spent on bookkeeping code in the interpreter, not actually performing your logic.<p>CPU utilization also measures all that waste when nothing is happening, when arguments are being prepared for a library function. Your program has already stalled, but the library function hasn't started executing yet for the silly reason that the arguments aren't ready because the CPU is fumbling around with them.<p>Boy, what a useless measure.
The core waiting for data to be loaded from RAM <i>is</i> busy. Busy waiting for data.<p>Instructions per cycle can also be misleading. Modern cpu's can do multiple shifts per cycle, but something like division takes a long time.<p>It all doesn't matter anyway, as instructions per cycle does <i>not</i> tell you anything specific. Use the cpu-builtin performance counters, use perf. It basically works by sampling every once in a while. It (perf, or any other tool that uses performance counters) shows you exactly what instructions are taking up your processes time. (hint: it's usually the ones that read data from memory; so be nice to your caches)<p>It's not rocket surgery.
This is silly. The conceit that ipc is a simplification for "higher is better" is exactly the problem he has with utilization.<p>True, but useful? Most of us are busy trying to get writes across a networked service. Indeed, getting to 50% utilization is often a dangerous place.<p>For reference, running your car by focusing on rpm of the engine is silly. But, it is a very good proxy and even more silly to try and avoid it. Only if you are seriously instrumented is this a valid path. And getting that instrumented is not cheap or free.
Using IPC as a proxy for utilization is tricky because an out-of-order machine can only get that max IPC if the instructions it is executing are not dependent on not-yet-computed instructions.<p>In-order CPUs are much easier to reason about; you can literally count the stalled cycles.
Totally disagree with the premise of the article. Every metric tool that i know of that shows cpu utilization correctly shows cpu work. Load on the other hand represents cpu and iowait (overall system throughput). Io wait is also exposed in top as the "wait" metric.
An amazon EC2 box can very easily get to load(5) = 10 (anything above 1 is considered bad), but the cpu utilization metric will still show almost no cpu util.
Well, this is the reason I hate HyperThreading, does your app consume 50% or 100% - with hyperthreading you have no clue.<p>And that is per core, it becomes increasingly meaningless on a dualcore and on a quadcore and above you might as well replace it with MS Clippy.<p>And this is before discussing what that percentage really means.<p>edit: I'm interpreting the downvotes that people are in denial about this ;)