It would be interesting to see how this scales with cores/threads. Current gen CPUs have ~128 threads per socket and you can find hardware that is quad socket. I've seen some systems with <512 concurrent processes running. In those use cases would this patch have the effect of pinning each process to a single thread (ignoring io/other sleep states)? Would the performance benefits of this patch scale super-linearly against thread count?