I gave a presentation last year on this for QConLondon (before the lock down) and afterwards for the LJC virtually, if people prefer to listen/watch videos.<p><a href="https://speakerdeck.com/alblue/understanding-cpu-microarchitecture-for-performance-ljc" rel="nofollow">https://speakerdeck.com/alblue/understanding-cpu-microarchit...</a>
Previous discussions:<p>2018, 87 comments: <a href="https://news.ycombinator.com/item?id=18230383" rel="nofollow">https://news.ycombinator.com/item?id=18230383</a><p>2016, 12 comments: <a href="https://news.ycombinator.com/item?id=11116211" rel="nofollow">https://news.ycombinator.com/item?id=11116211</a><p>2014, 37 comments: <a href="https://news.ycombinator.com/item?id=7174513" rel="nofollow">https://news.ycombinator.com/item?id=7174513</a><p>2011, 30 comments: <a href="https://news.ycombinator.com/item?id=2428403" rel="nofollow">https://news.ycombinator.com/item?id=2428403</a><p>This link has also appeared in 9 comments on HN, featuring threads on "Computer Architecture for Network Engineers", "X86 versus other architectures" by Linus Torvalds, and "I don't know how CPUs work so I simulated one in code", also recommending a udacity course on how modern processors work (<a href="https://www.udacity.com/course/high-performance-computer-architecture--ud007" rel="nofollow">https://www.udacity.com/course/high-performance-computer-arc...</a>): <a href="https://ampie.app/url-context?url=lighterra.com/papers/modernmicroprocessors" rel="nofollow">https://ampie.app/url-context?url=lighterra.com/papers/moder...</a><p>Jason also has a couple of other interesting articles on his website, like intro to instruction scheduling and software pipelining (<a href="http://www.lighterra.com/papers/basicinstructionscheduling/" rel="nofollow">http://www.lighterra.com/papers/basicinstructionscheduling/</a>) and the one I liked a lot and agree with called "exception handling considered harmful" (<a href="http://www.lighterra.com/papers/exceptionsharmful/" rel="nofollow">http://www.lighterra.com/papers/exceptionsharmful/</a>).
I would love a Bartosz Ciechanowski interactive article on microprocessors. It may be outside his domain though, since the visualisations and demo's would be less 3D model design, and more, perhaps, mini simulations of data channels or state machines that you can play through. Registers that can have initial values set, and then you can step through each clock cycle. Add a new component each few paragraphs and see how it all builds up. I did all this at university, but would love a refreshers that is as well made as his other blog posts.
>"One of the most interesting members of the RISC-style x86 group was the Transmeta Crusoe processor, which translated x86 instructions into an internal VLIW form, rather than internal superscalar, and used software to do the translation at runtime, much like a Java virtual machine. This approach allowed the processor itself to be a simple VLIW, without the complex x86 decoding and register-renaming hardware of decoupled x86 designs, and without any superscalar dispatch or OOO logic either."<p>PDS: Why do I bet that the Transmeta Crusoe didn't suffer from Spectre -- or any other other x86 cache-based or microcode-based security vulnerabilities that are so prevalent today?<p>Observation: Intentional hardware backdoors -- would have been difficult to place in Transmeta VLIW processors -- at least in the software-based x86 translation portions of it... Now, are there intentional hardware backdoors in its lower-level VLIW instructions?<p>I don't know and can't speculate on that...<p>Nor do I know if the Transmeta Crusoes contained secret deeply embedded "security" cores/processors -- or not...<p>But secret deeply embedded "security" cores/processors and backdoored VLIW instructions aside -- it would sure be hard as heck for the usual "powers-that-be" -- to be able to create secret/undocumented x86 instructions with side effects/covert communication to lower/secret levels -- and run that code from the Transmeta Crusoe's x86 software interpreter/translator -- especially if the code for the x86 software interpreter/translator -- is open source and throughly reviewed...<p>In other words, from a pro-security perspective -- <i>there's a lot to be said about architecturally simpler CPU's</i> -- regardless of how slow they might be compared to some of today's super-complex (and, ahem, <i>less secure</i>...) CPU's...
Isn't the material a little bit old? I remember reading about all this stuff in the University circa 1996.<p>Edit: originally said "outdated".
I was curious about the following comment on SMT in the post:<p>>"From a hardware point of view, implementing SMT requires duplicating all of the parts of the processor which store the "execution state" of each thread – things like the program counter, the architecturally-visible registers (but not the rename registers), the memory mappings held in the TLB, and so on. Luckily, these parts only constitute a tiny fraction of the overall processor's hardware."<p>Is each "SMT core" then just one additional PC and TLB then? I'm not sure if "SMT core" is the correct term or just "SMT" is but it seems like generally with Hyper Threading there is generally 1 hyper thread available for each core effectively doubling the total core count. It seems like it's been that way for a very long time. Is there not much benefit beyond offering single hyper thread/SMT for each core? Or is just prohibitively expensive?
I had a question about the following passage:<p>>"The key question is how the processor should make the guess. Two alternatives spring to mind. First, the compiler might be able to mark the branch to tell the processor which way to go. This is called static branch prediction. It would be ideal if there was a bit in the instruction format in which to encode the prediction, but for older architectures this is not an option, so a convention can be used instead, such as backward branches are predicted to be taken while forward branches are predicted not-taken.<p>Could someone say what the definition of "backward" vs a "forward" is? Is backward the loop continues and forward a jump or return from a loop?<p>Also are there any examples of "static branch prediction" CPU architectures?
I would love for an update that covers recent developments in SOC integrations, for example the onboarding of RAM and neural processing in the M1 chip.
These kinds of optimizations always make me wonder whether they are worth it. Might it be more efficient to use these transistors for more, simple cores instead? Perhaps the property that most problems are so sequential makes timing/clock rate optimizations inevitable.