Why is Rosetta 2 fast?

650 pointsby pantalaimonover 2 years ago

33 comments

This is a great writeup. What a clever design!I remember Apple had a totally different but equally clever solution back in the days of the 68K-to-PowerPC migration. The 68K had 16-bit instruction words, usually with some 16-bit arguments. The emulator’s core loop would read the next instruction and branch directly into a big block of 64K x 8 bytes of PPC code. So each 68K instruction got 2 dedicated PPC instructions, typically one to set up a register and one to branch to common code.What that solution and Rosetta 2 have in common is that they’re super pragmatic - fast to start up, with fairly regular and predictable performance across most workloads, even if the theoretical peak speed is much lower than a cutting-edge JIT.Anyone know how they implemented PPC-to-x86 translation?

评论 #33536544 未加载

评论 #33534201 未加载

评论 #33536314 未加载

评论 #33534298 未加载

hinkleyover 2 years ago

I remember years ago when Java adjacent research was all the rage, HP had a problem that was “Rosetta lite” if you will. They had a need to run old binaries on new hardware that wasn’t exactly backward compatible. They made a transpiler that worked on binaries. It might have even been a JIT but that part of the memory is fuzzy.What made it interesting here was that as a sanity check they made an A->A mode where they took in one architecture and spit out machine code for the same architecture. The output was faster than the input. Meaning that even native code has some room for improvement with JIT technology.I have been wishing for years that we were in a better place with regard to compilers and NP complete problems where the compilers had a fast mode for code-build-test cycles and a very slow incremental mode for official builds. I recall someone telling me the only thing they liked about the Rational IDE (C and C++?) was that it cached precompiled headers, one of the Amdahl’s Law areas for compilers. If you changed a header, you paid the recompilation cost and everyone else got a copy. I love whenever the person that cares about something gets to pay the consequence instead of externalizing it on others.And having some CI machines or CPUs that just sit around chewing on Hard Problems all day for that last 10% seems to be to be a really good use case in a world that’s seeing 16 core consumer hardware. Also caching hints from previous runs is a good thing.

评论 #33535829 未加载

评论 #33534018 未加载

评论 #33537786 未加载

评论 #33536158 未加载

评论 #33534905 未加载

评论 #33534365 未加载

评论 #33537396 未加载

评论 #33542633 未加载

评论 #33540942 未加载

评论 #33537876 未加载

评论 #33537087 未加载

评论 #33538845 未加载

评论 #33537793 未加载

评论 #33537753 未加载

darzuover 2 years ago

Does anyone know the names of the key people behind Rosetta 2?In my experience, exceptionally well executed tech like this tends to have 1-2 very talented people leading. I'd like to follow their blog or Twitter.

评论 #33535647 未加载

评论 #33535509 未加载

lunixbochsover 2 years ago

> To see ahead-of-time translated Rosetta code, I believe I had to disable SIP, compile a new x86 binary, give it a unique name, run it, and then run otool -tv /var/db/oah///unique-name.aot (or use your tool of choice – it’s just a Mach-O binary). This was done on old version of macOS, so things may have changed and improved since then.My aotool project uses a trick to extract the AOT binary without root or disabling SIP: <a href="https://github.com/lunixbochs/meta/tree/master/utils/aotool" rel="nofollow">https://github.com/lunixbochs/meta/tree/master/utils/aotool</a>

menaerusover 2 years ago

> Rosetta 2 translates the entire text segment of the binary from x86 to ARM up-front.Do I understand correctly that the Rosetta is basically a transpiler from x86-64 machine code to ARM machine code which is run prior to the binary execution? If so, does it affect the application startup times?

评论 #33534063 未加载

评论 #33535090 未加载

评论 #33533964 未加载

johnthussover 2 years ago

"I believe there’s significant room for performance improvement in Rosetta 2... However, this would come at the cost of significantly increased complexity... Engineering is about making the right tradeoffs, and I’d say Rosetta 2 has done exactly that."

评论 #33538633 未加载

saagarjhaover 2 years ago

One thing that’s interesting to note is that the amount of effort expended here is not actually all that large. Yes, there are smart people working on this, but the performance of Rosetta 2 for the most part is probably the work of a handful of clever people. I wouldn’t be surprised if some of them have an interest in compilers but the actual implementation is fairly straightforward and there isn’t much of the stuff you’d typically see in an optimizing JIT: no complicated type theory or analysis passes. Aside from a handful of hardware bits and some convenient (perhaps intentionally selected) choices in where to make tradeoffs there’s nothing really specifically amazing here. What really makes it special is that anyone (well, any company with a bit of resources) could’ve done it but nobody really did. (But, again, Apple owning the stack and having past experience probably did help them get over the hurdle of actually putting effort into this.)

评论 #33540060 未加载

评论 #33544782 未加载

karmakazeover 2 years ago

Vertical integration. My understanding was it's because the Apple silicon ARM has special support to make it fast. Apple has had enough experience to know that some hardware support can go a long way to making the binary emulation situation better.

评论 #33536528 未加载

hinkleyover 2 years ago

Apple is doing some really interesting but really quiet work in the area of VMs. I feel like we don’t give them enough credit but maybe they’ve put themselves in that position by not bragging enough about what they do.As a somewhat related aside, I have been watching Bun (low startup time Node-like on top of Safari’s JavaScript engine) with enough interest that I started trying to fix a bug, which is somewhat unusual for me. I mostly contribute small fixes to tools I use at work. I can’t quite grok Zig code yet so I got stuck fairly quickly. The “bug” turned out to be default behavior in a Zig stdlib, rather than in JavaScript code. The rest is fairly tangential but suffice it to say I prefer self hosted languages but this probably falls into the startup speed compromise.Being low startup overhead makes their VM interesting, but the fact that it benchmarks better than Firefox a lot of the time and occasionally faster than v8 is quite a bit of quiet competence.

评论 #33534041 未加载

kccqzyover 2 years ago

> The instructions from FEAT_FlagM2 are AXFLAG and XAFLAG, which convert floating-point condition flags to/from a mysterious “external format”. By some strange coincidence, this format is x86, so these instruction are used when dealing with floating point flags.This really made me chuckle. They probably don't want to mention Intel by name, but this just sounds funny.<a href="https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-General-Instructions/XAFlag" rel="nofollow">https://developer.arm.com/documentation/100076/0100/A64-Inst...</a>

Vt71fcAqt7over 2 years ago

I hope Rosetta is here to stay and continues developement. And I hope what is learned from it can be used to make a RISC-V version of it. translating native ARM to RISC-V should be much easier than x86 to ARM as I understand it, so one could conceivably do x86 -> ARM -> RISC-V.

评论 #33533675 未加载

评论 #33538621 未加载

评论 #33534070 未加载

peatmossover 2 years ago

Not having any particular domain experience here, I've idly wondered whether or not there's any role for neural net models in translating code for other architectures.We have giant corpuses of source code, compiled x86_64 binaries, and compiled arm64 binaries. I assume the compiled binaries represent approximately our best compiler technology. It seems predicting an arm binary from an x86_64 binary would not be insane?If someone who actually knows anything here wants to disabuse me of my showerthoughts, I'd appreciate being able to put the idea out of my head :-)

评论 #33533888 未加载

评论 #33534094 未加载

评论 #33534048 未加载

评论 #33533992 未加载

评论 #33536875 未加载

评论 #33534487 未加载

musicaleover 2 years ago

Rosetta 2 is beautiful - I would love it if they kept it as a feature for the long term rather than deprecating it and removing it in the next release of macOS (basically what they did during previous architectural transitions.)If Apple does drop it, maybe they could open source it so it could live on in Linux and BSD at least. ;-)Adding a couple of features to ARM to drastically improve translated x86 code execution sounds like a decent idea - and one that could potentially enable better x86 app performance on ARM Windows as well. I don't know the silicon cost but I'd hope it wasn't dropped in the future.'Thinking a bit larger, I'd also like to see Apple add something like CHERI support to Apple Silicon and macOS to enable efficient memory error checking in hardware. I'd be surprised if they weren't working on something like this already.

pjmlpover 2 years ago

Back in the early days of Windows NT everywhere, the Alpha version had a similar JIT emulation.<a href="https://en.m.wikipedia.org/wiki/FX!32" rel="nofollow">https://en.m.wikipedia.org/wiki/FX!32</a>Or for a more technical deep dive,<a href="https://www.usenix.org/publications/library/proceedings/usenix-nt97/full_papers/chernoff/chernoff.pdf" rel="nofollow">https://www.usenix.org/publications/library/proceedings/usen...</a>

评论 #33534808 未加载

hot_grilover 2 years ago

Rosetta 2 has become the poster child for "innovation without deprecation" where I work (not Apple).

评论 #33537986 未加载

webwielder2over 2 years ago

(Apologies for the flame war quality to this comment, I’m genuinely just expressing an observation)It’s ironic that Apple is often backhandedly complimented by hackers as having “good hardware” when their list of software accomplishments is amongst the most impressive in the industry and contrasts sharply with the best efforts of, say, Microsoft, purportedly a “software company.”

评论 #33544415 未加载

评论 #33543773 未加载

manv1over 2 years ago

Apple's historically been pretty good at making this stuff. Their first 68k -> PPC emulator (Davidian's) was so good that for some things the PPC Mac was the fastest 68k mac you could buy. The next-gen DR emulator (and SpeedDoubler etc) made things even faster.I suspect the ppc->x86 stuff was slower because x86 just doesn't have the registers. There's only so much you can do.

评论 #33534670 未加载

评论 #33534714 未加载

评论 #33535618 未加载

dynjoover 2 years ago

It is quite astonishing how seamless Apple has managed to make the Intel to ARM transition, there are some seriously smart minds behind Rosetta. I honestly don't think I had a single software issue during the transition!

评论 #33533980 未加载

评论 #33534582 未加载

评论 #33534724 未加载

评论 #33535035 未加载

评论 #33533956 未加载

评论 #33534093 未加载

评论 #33538675 未加载

评论 #33535088 未加载

评论 #33534852 未加载

评论 #33533994 未加载

spullaraover 2 years ago

The first time I ran into this technology was in the early 90s on the DEC Alpha. They had a tool called "MX" that would translate MIPS Ultrix binaries to Alpha on DEC Unix:<a href="https://www.linuxjournal.com/article/1044" rel="nofollow">https://www.linuxjournal.com/article/1044</a>Crazy stuff. Rosetta 2 is insanely good. Runs FPS video games even.

lzoozover 2 years ago

>The Apple M1 has an undocumented extension that, when enabled, ensures instructions like ADDS, SUBS and CMP compute PF and AF and store them as bits 26 and 27 of NZCV respectively, providing accurate emulation with no performance penalty.If there is no performance penalty why is it implemented as an optional extension?

jeffbeeover 2 years ago

I wonder how much hand-tuning there is in Rosetta 2 for known, critical routines. One of the tricks Transmeta used to get reasonable performance on their very slow Crusoe CPU was to recognize critical Windows functions and replace them with a library of hand-optimized native routines. Of course that's a little different because Rosetta 2 is targeting an architecture that is generally speaking at least as fast as the x86 architecture it is trying to emulate, and that's been true for most cross-architecture translators historically like DEC's VEST that ran VAX code on Alpha, but Transmeta CMS was trying to target a CPU that was slower.

评论 #33536717 未加载

fulafelover 2 years ago

For history this was a major milestone in x86 binary translation, Digital FX!32:<a href="https://www.usenix.org/legacy/publications/library/proceedings/usenix-nt97/full_papers/chernoff/chernoff.pdf" rel="nofollow">https://www.usenix.org/legacy/publications/library/proceedin...</a>Some apps run faster than the fastest available x86 at the time, and sometimes significantly faster like the Byte benchmark cited above. Of course it helped that the chip was sugnificantly faster than leading x86 chips in the first place.

retskradover 2 years ago

Apple Silicon will be Tim Cook's legacy.

fooblasterover 2 years ago

A lot of the simplicity of this approach relies on the x86 registers being able to be directly mapped to arm registers. This seems to be possible for most x86 registers, even simd registers. Although, I think this falls over for avx512, which is supported on the Mac pro. Arm neon has 32 128 bit registers - avx512 supports 32 512 bit registers and dedicated predicate registers. What do they do? Back to jit mode?

评论 #33548609 未加载

0x000xca0xfeover 2 years ago

I wonder if such a direct translation from ARM to another architecture would even be possible given that the instruction set can be changed at runtime (thumb mode). Does anybody know how often typical ARM32 programs execute this mode switching or if such sections can be recognized statically?

评论 #33540917 未加载

评论 #33540103 未加载

评论 #33542754 未加载

titzerover 2 years ago

Rosetta 2 is great, except it apparently can't run statically-linked (non-PIC) binaries. I am unsure why this limitation exists, but it's pretty annoying because Virgil x86-64-binaries cannot run under Rosetta 2, which means I resort to running on the JVM on my M1...

评论 #33535175 未加载

评论 #33536674 未加载

评论 #33535667 未加载

评论 #33536030 未加载

pjmlpover 2 years ago

Back in the early days of Windows NT everywhere, the Alpha version had a similar JIT emulation.

agentcooperover 2 years ago

I am interested in this domain, but lacking knowledge to fully understand the post. Any recommendations on good books/courses/tutorials related to low level programming?

评论 #33536706 未加载

tomcamover 2 years ago

> Every one-byte x86 push becomes a four byte ARM instructionCan someone explain this to me? I don’t know ARM but it just seems to me a push should not be that expensive.

评论 #33537566 未加载

评论 #33543882 未加载

评论 #33537081 未加载

ericbarrettover 2 years ago

Anybody know if Docker has plans to move from qemu to Rosetta on M1/2 Macs? I've found qemu to be at least 100x slower than the native arch.

MikusRover 2 years ago

The main reason, M1/2 being incredibly fast. Is listed last.

评论 #33533832 未加载

评论 #33533531 未加载

评论 #33533692 未加载

rgiacobazziover 2 years ago

Great article!

sedatkover 2 years ago

TL;DR: One-to-one instruction translation ahead of time instead of complex JIT translations to bet on M1's performance and instruction cache handling.

33 comments

iainmerrickover 2 years ago

评论 #33536544 未加载

评论 #33534201 未加载

评论 #33536314 未加载

评论 #33534298 未加载

hinkleyover 2 years ago

评论 #33535829 未加载

评论 #33534018 未加载

评论 #33537786 未加载

评论 #33536158 未加载

评论 #33534905 未加载

评论 #33534365 未加载

评论 #33537396 未加载

评论 #33542633 未加载

评论 #33540942 未加载

评论 #33537876 未加载

评论 #33537087 未加载

评论 #33538845 未加载

评论 #33537793 未加载

评论 #33537753 未加载

darzuover 2 years ago

评论 #33535647 未加载

评论 #33535509 未加载

lunixbochsover 2 years ago

menaerusover 2 years ago

评论 #33534063 未加载

评论 #33535090 未加载

评论 #33533964 未加载

johnthussover 2 years ago

评论 #33538633 未加载

saagarjhaover 2 years ago

评论 #33540060 未加载

评论 #33544782 未加载

karmakazeover 2 years ago

评论 #33536528 未加载

hinkleyover 2 years ago

评论 #33534041 未加载

kccqzyover 2 years ago

Vt71fcAqt7over 2 years ago

评论 #33533675 未加载

评论 #33538621 未加载

评论 #33534070 未加载

peatmossover 2 years ago

评论 #33533888 未加载

评论 #33534094 未加载

评论 #33534048 未加载

评论 #33533992 未加载

评论 #33536875 未加载

评论 #33534487 未加载

musicaleover 2 years ago

pjmlpover 2 years ago

评论 #33534808 未加载

hot_grilover 2 years ago

Rosetta 2 has become the poster child for "innovation without deprecation" where I work (not Apple).

评论 #33537986 未加载

webwielder2over 2 years ago

评论 #33544415 未加载

评论 #33543773 未加载

manv1over 2 years ago

评论 #33534670 未加载

评论 #33534714 未加载

评论 #33535618 未加载

dynjoover 2 years ago

评论 #33533980 未加载

评论 #33534582 未加载

评论 #33534724 未加载

评论 #33535035 未加载

评论 #33533956 未加载

评论 #33534093 未加载

评论 #33538675 未加载

评论 #33535088 未加载

评论 #33534852 未加载

评论 #33533994 未加载

spullaraover 2 years ago

lzoozover 2 years ago

jeffbeeover 2 years ago

评论 #33536717 未加载

fulafelover 2 years ago

retskradover 2 years ago

Apple Silicon will be Tim Cook's legacy.

fooblasterover 2 years ago

评论 #33548609 未加载

0x000xca0xfeover 2 years ago

评论 #33540917 未加载

评论 #33540103 未加载

评论 #33542754 未加载

titzerover 2 years ago

评论 #33535175 未加载

评论 #33536674 未加载

评论 #33535667 未加载

评论 #33536030 未加载

pjmlpover 2 years ago

Back in the early days of Windows NT everywhere, the Alpha version had a similar JIT emulation.

agentcooperover 2 years ago

I am interested in this domain, but lacking knowledge to fully understand the post. Any recommendations on good books/courses/tutorials related to low level programming?

评论 #33536706 未加载

tomcamover 2 years ago

> Every one-byte x86 push becomes a four byte ARM instructionCan someone explain this to me? I don’t know ARM but it just seems to me a push should not be that expensive.

评论 #33537566 未加载

评论 #33543882 未加载

评论 #33537081 未加载

ericbarrettover 2 years ago

Anybody know if Docker has plans to move from qemu to Rosetta on M1/2 Macs? I've found qemu to be at least 100x slower than the native arch.

MikusRover 2 years ago

The main reason, M1/2 being incredibly fast. Is listed last.

评论 #33533832 未加载

评论 #33533531 未加载

评论 #33533692 未加载

rgiacobazziover 2 years ago

Great article!

sedatkover 2 years ago

TL;DR: One-to-one instruction translation ahead of time instead of complex JIT translations to bet on M1's performance and instruction cache handling.