Here's one better, if all your data is strictly less than (not equal to) a cache line (usually 64 bytes) in size, and your processor guarantees memory ordering within a cache line (most do):<p>1) Keep head and tail pointers local to the consumer and producer.<p>2) Associate a bit with each entry in the queue which denotes whether the entry is full or not. The bit must live within the same cache line as the entry itself.<p>3) Block on this bit, rather than head & tail pointers.<p>4) Set an entry's bit after filling the entry; clear it before.<p>5) You can now elide the memory fence (implicit in the .lazySet() method of the atomic objects). Performance will skyrocket.
The presentation in pdf <a href="http://qconsf.com/dl/qcon-sanfran-2012/slides/MartinThompson_LockFreeAlgorithmsForUltimatePerformanceMOVEDTOBALLROOMA.pdf" rel="nofollow">http://qconsf.com/dl/qcon-sanfran-2012/slides/MartinThompson...</a> for off-line viewing.
If you interested in HPC topic, especially in terms of HFT, you should his previous presentation on LMAX system:
<a href="http://www.infoq.com/presentations/LMAX" rel="nofollow">http://www.infoq.com/presentations/LMAX</a>
Really good presentation, although a bit long.<p>What was most interesting to me was the actual performance figures and the small optimizations that yield big improvements.