One thing that stood out to me is that allocations were reduced from 18-30 per op down to zero across the board. I'd be interested in the techniques they used to achieve this, and if any could be applied to the standard library itself. Also, a benchcmp would be nice.