This is a great writeup. What a clever design!<p>I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.<p>What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.<p>Anyone know how they implemented PPC-to-x86 translation?
I remember years ago when Java adjacent research was all the rage, HP had a problem that was “Rosetta lite” if you will. They had a need to run old binaries on new hardware that wasn’t exactly backward compatible. They made a transpiler that worked on binaries. It might have even been a JIT but that part of the memory is fuzzy.<p>What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.<p>I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.<p>And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.
Does anyone know the names of the key people behind Rosetta 2?<p>In my experience, exceptionally well executed tech like this tends to have 1-2 very talented people leading. I'd like to follow their blog or Twitter.
> To see ahead-of-time translated Rosetta code, I believe I had to disable SIP, compile a new x86 binary, give it a unique name, run it, and then run otool -tv /var/db/oah/<i>/</i>/unique-name.aot (or use your tool of choice – it’s just a Mach-O binary). This was done on old version of macOS, so things may have changed and improved since then.<p>My aotool project uses a trick to extract the AOT binary without root or disabling SIP: <a href="https://github.com/lunixbochs/meta/tree/master/utils/aotool" rel="nofollow">https://github.com/lunixbochs/meta/tree/master/utils/aotool</a>
> Rosetta 2 translates the entire text segment of the binary from x86 to ARM up-front.<p>Do I understand correctly that the Rosetta is basically a transpiler from x86-64 machine code to ARM machine code which is run prior to the binary execution? If so, does it affect the application startup times?
"I believe there’s significant room for performance improvement in Rosetta 2... However, this would come at the cost of significantly increased complexity...
Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that."
One thing that’s interesting to note is that the amount of effort expended here is not actually all that large. Yes, there are smart people working on this, but the performance of Rosetta 2 for the most part is probably the work of a handful of clever people. I wouldn’t be surprised if some of them have an interest in compilers but the actual implementation is fairly straightforward and there isn’t much of the stuff you’d typically see in an optimizing JIT: no complicated type theory or analysis passes. Aside from a handful of hardware bits and some convenient (perhaps intentionally selected) choices in where to make tradeoffs there’s nothing really specifically amazing here. What really makes it special is that anyone (well, any company with a bit of resources) could’ve done it but nobody really did. (But, again, Apple owning the stack and having past experience probably did help them get over the hurdle of actually putting effort into this.)
Vertical integration. My understanding was it's because the Apple silicon ARM has special support to make it fast. Apple has had enough experience to know that some hardware support can go a long way to making the binary emulation situation better.
Apple is doing some really interesting but really quiet work in the area of VMs. I feel like we don’t give them enough credit but maybe they’ve put themselves in that position by not bragging enough about what they do.<p>As a somewhat related aside, I have been watching Bun (low startup time Node-like on top of Safari’s JavaScript engine) with enough interest that I started trying to fix a bug, which is somewhat unusual for me. I mostly contribute small fixes to tools I use at work. I can’t quite grok Zig code yet so I got stuck fairly quickly. The “bug” turned out to be default behavior in a Zig stdlib, rather than in JavaScript code. The rest is fairly tangential but suffice it to say I prefer self hosted languages but this probably falls into the startup speed compromise.<p>Being low startup overhead makes their VM interesting, but the fact that it benchmarks better than Firefox a lot of the time and occasionally faster than v8 is quite a bit of quiet competence.
> The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which convert floating-point condition flags to/from a mysterious “external format”. By some strange coincidence, this format is x86, so these instruction are used when dealing with floating point flags.<p>This really made me chuckle. They probably don't want to mention Intel by name, but this just sounds funny.<p><a href="https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-General-Instructions/XAFlag" rel="nofollow">https://developer.arm.com/documentation/100076/0100/A64-Inst...</a>
I hope Rosetta is here to stay and continues developement. And I hope what is learned from it can be used to make a RISC-V version of it. translating native ARM to RISC-V should be much easier than x86 to ARM as I understand it, so one could conceivably do x86 -> ARM -> RISC-V.
Not having any particular domain experience here, I've idly wondered whether or not there's any role for neural net models in translating code for other architectures.<p>We have giant corpuses of source code, compiled x86_64 binaries, and compiled arm64 binaries. I assume the compiled binaries represent approximately our best compiler technology. It seems predicting an arm binary from an x86_64 binary would not be insane?<p>If someone who actually knows anything here wants to disabuse me of my showerthoughts, I'd appreciate being able to put the idea out of my head :-)
Rosetta 2 is beautiful - I would love it if they kept it as a feature for the long term rather than deprecating it and removing it in the next release of macOS (basically what they did during previous architectural transitions.)<p>If Apple does drop it, maybe they could open source it so it could live on in Linux and BSD at least. ;-)<p>Adding a couple of features to ARM to drastically improve translated x86 code execution sounds like a decent idea - and one that could potentially enable better x86 app performance on ARM Windows as well. I don't know the silicon cost but I'd hope it wasn't dropped in the future.'<p>Thinking a bit larger, I'd also like to see Apple add something like CHERI support to Apple Silicon and macOS to enable efficient memory error checking in hardware. I'd be surprised if they weren't working on something like this already.
Back in the early days of Windows NT everywhere, the Alpha version had a similar JIT emulation.<p><a href="https://en.m.wikipedia.org/wiki/FX!32" rel="nofollow">https://en.m.wikipedia.org/wiki/FX!32</a><p>Or for a more technical deep dive,<p><a href="https://www.usenix.org/publications/library/proceedings/usenix-nt97/full_papers/chernoff/chernoff.pdf" rel="nofollow">https://www.usenix.org/publications/library/proceedings/usen...</a>
(Apologies for the flame war quality to this comment, I’m genuinely just expressing an observation)<p>It’s ironic that Apple is often backhandedly complimented by hackers as having “good hardware” when their list of software accomplishments is amongst the most impressive in the industry and contrasts sharply with the best efforts of, say, Microsoft, purportedly a “software company.”
Apple's historically been pretty good at making this stuff. Their first 68k -> PPC emulator (Davidian's) was so good that for some things the PPC Mac was the fastest 68k mac you could buy. The next-gen DR emulator (and SpeedDoubler etc) made things even faster.<p>I suspect the ppc->x86 stuff was slower because x86 just doesn't have the registers. There's only so much you can do.
It is quite astonishing how seamless Apple has managed to make the Intel to ARM transition, there are some seriously smart minds behind Rosetta. I honestly don't think I had a single software issue during the transition!
The first time I ran into this technology was in the early 90s on the DEC Alpha. They had a tool called "MX" that would translate MIPS Ultrix binaries to Alpha on DEC Unix:<p><a href="https://www.linuxjournal.com/article/1044" rel="nofollow">https://www.linuxjournal.com/article/1044</a><p>Crazy stuff. Rosetta 2 is insanely good. Runs FPS video games even.
>The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty.<p>If there is no performance penalty why is it implemented as an optional extension?
I wonder how much hand-tuning there is in Rosetta 2 for known, critical routines. One of the tricks Transmeta used to get reasonable performance on their very slow Crusoe CPU was to recognize critical Windows functions and replace them with a library of hand-optimized native routines. Of course that's a little different because Rosetta 2 is targeting an architecture that is generally speaking at least as fast as the x86 architecture it is trying to emulate, and that's been true for most cross-architecture translators historically like DEC's VEST that ran VAX code on Alpha, but Transmeta CMS was trying to target a CPU that was slower.
For history this was a major milestone in x86 binary translation, Digital FX!32:<p><a href="https://www.usenix.org/legacy/publications/library/proceedings/usenix-nt97/full_papers/chernoff/chernoff.pdf" rel="nofollow">https://www.usenix.org/legacy/publications/library/proceedin...</a><p>Some apps run faster than the fastest available x86 at the time, and sometimes significantly faster like the Byte benchmark cited above. Of course it helped that the chip was sugnificantly faster than leading x86 chips in the first place.
A lot of the simplicity of this approach relies on the x86 registers being able to be directly mapped to arm registers. This seems to be possible for most x86 registers, even simd registers. Although, I think this falls over for avx512, which is supported on the Mac pro. Arm neon has 32 128 bit registers - avx512 supports 32 512 bit registers and dedicated predicate registers. What do they do? Back to jit mode?
I wonder if such a direct translation from ARM to another architecture would even be possible given that the instruction set can be changed at runtime (thumb mode).
Does anybody know how often typical ARM32 programs execute this mode switching or if such sections can be recognized statically?
Rosetta 2 is great, except it apparently can't run statically-linked (non-PIC) binaries. I am unsure why this limitation exists, but it's pretty annoying because Virgil x86-64-binaries cannot run under Rosetta 2, which means I resort to running on the JVM on my M1...
I am interested in this domain, but lacking knowledge to fully understand the post. Any recommendations on good books/courses/tutorials related to low level programming?
> Every one-byte x86 push becomes a four byte ARM instruction<p>Can someone explain this to me? I don’t know ARM but it just seems to me a push should not be that expensive.
TL;DR: One-to-one instruction translation ahead of time instead of complex JIT translations to bet on M1's performance and instruction cache handling.