One thing I've learned about lock-free structures: DWCAS[1] is your friend. Using DWCAS, the Michael and Scott Queue algorithm is trivial to implement. Not really sure why the author insists on using plain CAS, especially when implementing a DWCAS-based algorithm.<p>[1] <a href="http://en.wikipedia.org/wiki/Double_compare-and-swap" rel="nofollow">http://en.wikipedia.org/wiki/Double_compare-and-swap</a>
This seems very (very) strange to me. Did a simplified mockup in Java using a ConcurrentLinkedDeque. Got an absolute speed-up of ~8 going from 1 thread to 2 (or 4)...and that's on a laptop.
Did you perhaps measure CPU time rather than wall clock or push entries scaled by the number of threads? Didn't find your test source on GitHub..