Instructions per cycle: AMD Zen 2 versus Intel

108 点作者 another超过 5 年前

13 条评论

NohatCoder超过 5 年前

In case anyone is not aware: This is a very small sample of microbenchmarks. When benchmarking very simple tasks like these performance tend to vary wildly between architectures.For instance instructions are assigned to one of a handful of ports when executed, certain instructions may only be assigned to certain ports, what ports an instruction may be assigned to differ between architectures. If an inner loop use only a few different instructions one architecture may be unlucky in that most of the instructions need the same ports, and so it can execute fewer instruction overall.For real benchmarking use lots of different complicated jobs. It is not perfect, but it is the best way we have of comparing different processors head to head.

评论 #21723551 未加载

评论 #21726491 未加载

yifanlu超过 5 年前

Assuming both Intel and AMD implement performance monitors the same (i.e. same notion of instructions executed, which may be hard to measure with speculative execution), the comparison is still flawed because it doesn’t matter if Intel can do more instruction per cycle if AMD can produce more cycles in a span of wall time.> However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.That’s precisely why nobody really uses IPC as a way to compare processors. “How much work done per unit of time” is a much better measurement and I guess for historical reasons, people conflate it with IPC.But real textbook IPC is useless for comparison.

评论 #21719692 未加载

评论 #21720002 未加载

评论 #21720870 未加载

评论 #21721271 未加载

评论 #21720282 未加载

评论 #21719674 未加载

评论 #21721091 未加载

jonstewart超过 5 年前

It’s depressing how many comments here are quick to dismiss the benchmarking/article. Yes, yes, memory bandwidth, I/O, and cache hierarchies are all important, but Daniel Lemire is one of the top people in the world when it comes to optimizing algorithms for modern CPUs. Do you like search engines? Lemire has made them significantly faster. He is often able to take code/algorithms that already seem fast, and make them much faster. He’s recently branched out beyond search engine core algorithms into some aspects of string processing (base64, UTF-8 validation, JSON parsing).In this blog post, he’s paying attention to IPC because he’s typically working with inner loops where the data’s being delivered from RAM to L1 as efficiently as possible.

评论 #21724780 未加载

reitzensteinm超过 5 年前

The second example is just a benchmark of tzcnt, added in BMI1. It's a very specific and very bizarre benchmark to do when you could just look up the reciprocal throughput (unfortunately Zen 2 has not yet been added).<a href="https://www.agner.org/optimize/instruction_tables.pdf" rel="nofollow">https://www.agner.org/optimize/instruction_tables.pdf</a>Edit: This is wrong as BeeOnRope points out below.The first is SIMD heavy, so Zen 2 mostly closing the gap with Intel in one of the areas where Zen 1 was very weak is a good thing.

评论 #21719549 未加载

eyegor超过 5 年前

I think the only real way to compare IPC is to actually talk to the architects. Trying to write microbenchmarks is a fools errand when you aren't aware of how the cpu processes the instructions you give it. Are you actually stressing the fpu, or is the cpu speculatively executing and then branch predicting the workload (common for micro loops)? If it is, is that what you meant to test? Are you trying to compare like for like (in which case you have to write assembly), or are you trying to write performance benchmarks (and then the only meaningful metric is cpu time)?This is an interesting idea, but I'm not sure how you could derive meaning from comparing two vastly different architectures at such a high level.

alecmg超过 5 年前

Useless, strictly academic interest.There is more than execution ports in design of processors. Not every task can be SIMD optimized to extent of approaching theoretical IPC limits, most will be bottlenecked by memory access or even IO.I prefer the "fake" but real-world IPC. Same clocks, same real world task, measure time to finish.

评论 #21720236 未加载

评论 #21721985 未加载

评论 #21720231 未加载

zippie超过 5 年前

IPC microbenchmarks do not properly reflect the complex workloads running on post Zen2 microarchitecture. Zen2 upends microarchitecture schematics enough to warrant a different metric.IPC MB’s, in my experience, tend to benchmark best case scenarios and that is probably the exception rather than the rule for application workloads in modern MA’s. Case in point, microbenchmarks showed significant improvements in IPC for Zen2 in lieu of Skylake yet for the application workload (CPU data bound), Skylake held up neck and neck.The more appropriate benchmarking metric for post-Zen2 processors is CPI [0].[0] <a href="https://john.e-wilkes.com/papers/2013-EuroSys-CPI2.pdf" rel="nofollow">https://john.e-wilkes.com/papers/2013-EuroSys-CPI2.pdf</a>

评论 #21723487 未加载

chucklenorris超过 5 年前

Heh, I'm curious if he used the mitigations for all the side channel flaws for the intel processors.

评论 #21719681 未加载

_ph_超过 5 年前

While only being part of the performance equation, analyzing IPC can be quite interesting in understanding the design of the processor and how performance might be achieved.One thing itches me with the presented comparison: it is running very few benchmarks generated with the same compiler. For a thorough IPC analysis, shouldn't the tests rather being programmed in assembly to exclude any influence by the compiler choice? Also probably a wider range of algorithms should be checked, as IPC on modern processors depends less on how many cycles a certain instruction takes (you should be able to find that in the manuals), but how well multiple components of the processor can be utilized at the same time. Which extremely depends on the actual program to be run.

tempguy9999超过 5 年前

I'm rather surprised at the claim that "but it might easily execute 7 billion instructions per second on a single core". I'd even question it except the author's an expert.If you can keep it fed then ok but one cache miss to main mem, either instruction or data, will allow the instruction buffers to completely empty and stay empty for quite a long time. I don't think you can control placement to reasonably assure cache hits always for anything but the most trivial code, am I missing something?Also if you could keep a consistent throughput like this I wonder if thermal throttling might have to kick in. I mean you're doing a lot of work...

评论 #21722356 未加载

Const-me超过 5 年前

I wonder how reliable are these Linux syscalls?Found this <a href="http://manpages.ubuntu.com/manpages/trusty/man2/perf_event_open.2.html" rel="nofollow">http://manpages.ubuntu.com/manpages/trusty/man2/perf_event_o...</a> and that article doesn't instill much confidence in the reliability of these counters. Comment for CPU_CYCLES says "Be wary of what happens during CPU frequency scaling", comment for INSTRUCTIONS says "these can be affected by various issues, most notably hardware interrupt counts", BRANCH_INSTRUCTIONS says "Prior to Linux 2.6.34, this used the wrong event on AMD processors" and so on.If I wanted to measure what OP was measuring, I would disable frequency scaling (probably doable on overclocker-targeted motherboards, also search finds some utilities which claim to do that, both windows and linux ones), measure time, then divide by frequency.

评论 #21719514 未加载

nabla9超过 5 年前

In more comprehensive single thread benchmarks (single thread POV Ray) Intel can still beat Zen 2 architecture sometimes. This test seems to indicate the reason why.

qxnqd超过 5 年前

ITT: AMD apologists.Sorry guys but Intel is still king of single core performance. But that's not a problem because I'm sure by 2050 most desktop applications and games will correctly make use of many cores, then AMD will reign

评论 #21720787 未加载

评论 #21726648 未加载