"This results in a loss of a single cycle at the time of instruction fetch." Maybe on that paper CPU but branch mis-predicts on Skylake are 16.5 cycles if there's a μop cache hit and 19-20 cycles if there isn't.<p><a href="https://www.7-cpu.com/cpu/Skylake.html" rel="nofollow">https://www.7-cpu.com/cpu/Skylake.html</a><p>That said, I didn't know about using BPM to access the PMC performance registers.