Lockfree slab allocator for <i>smaller blocks</i> is frequently all that's needed, especially in STL-heavy C++ code.<p>There is typically heck of a lot allocations of 0-32 byte blocks, about half of that of 33-64, a further half of that of 65-128, etc. - meaning that if one optimizes allocation of blocks smaller than 512 bytes, it would yield 80-90% of achievable speed gain due to a better allocator.<p>In practical terms it translates into simpler implementation - just set up a bunch of slab buckets - one for 0-16 byte blocks, next - for up to 32, next - for up to 63, etc. - and pass larger allocation requests to default malloc. Exact size ranges are easy to determine by running the app with a proxy allocator and looking at the block size histogram. I did this with several projects and in all cases histograms stabilized very quickly. The tricky part is a lock-free management of the slabs, especially the disposal, but it is not a rocket science by any means.<p>A real-world example - adding a lock-free slab allocator to a Windows app that did some multi-threaded data crunching yielded 400% speed up compared to native HeapAlloc().