> But I was able to work around this by using a trick: creating two variants of the function, one marked with #[inline(always)] (for the hot call sites) and one marked with #[inline(never)] (for the cold call sites).<p>Can't PGO make inlining decisions like this? Otherwise, propeller/LTO might work well.<p>> But there’s a trade-off. Sometimes a simpler, smaller function is slower.<p>Without a doubt! Imagine the naive/simple/portable memcpy versus a target-aware one that capitalizes on wider or aligned loads and stores.