This seems like a pretty clever approach to tuning the performance of memcpy and memset, since proper instrumentation should enable you to make the right decision(s) about whether to bypass cache on a given operation most of the time. In the cases where you mess up, you can probably spot those by doing a second instrumentation pass after applying the optimizations (to see if any of them actually caused an overall performance hit).<p>It's nice to see that they tested this on a bunch of use cases, as well... and on that note, the adsense-serving benchmark spends 37.5% of its CPU time in memcpy! That's insane! I wonder if it's basically just a web server benchmark where it's calling memcpy a lot to serve up static assets over HTTP?