> Both systems ran Z3 4.8.1, compiled by me <i>using Clang</i> with the same optimization settings.<p>Hypothesis: LLVM's AArch64 backend has had more work put into it (by Apple, at least) than LLVM's x86_64 backend has, specifically for "finishing one-off tasks quickly" (as opposed to "achieving high throughput on long-running tasks.")<p>To me, this would make sense—until recently, AArch64 devices were mostly always-mobile, and so needed to be optimized for performing compute-intensive workloads <i>on battery</i>, and so have had more thought into the <i>efficiency</i> of their single-threaded burst performance (the whole "race to sleep" thing.) I'd expect, for example, AArch64-optimized code-gen to favor low-code-size serial loops over large-code-size SIMD vector ops, ala GCC -Os, in order to both 1. keep the vector units powered down, and 2. keep cache-line contention lower and thereby keep now-unneeded DMA channels powered down; both keeping the chip further from the TDP ceiling; and thus keeping the rest of the core that <i>is</i> powered on able to burst-clock longer. In such a setup, the simple serial-processing loop may potentially outperform the SIMD ops. (Presuming the loop has a variable number of executions that may be long or short, and is itself run frequently.)<p>x86_64 devices, meanwhile, generally are only expected to perform compute-intensive tasks <i>while connected to power</i>, and so the optimizations contributed to compilers like LLVM that specifically impact x86_64, are likely more from the HPC and OLTP crowds, who favor squeezing out continuous aggregate <i>throughput</i>, at the expense of per-task time-to-completion (i.e. holding onto Turbo Boost-like features at maximum duty cycle, to increase <i>mean</i> throughput, even as the overheat conditions and license-switching overhead lower <i>modal</i> task performance.)