Here's one better, if all your data is strictly less than (not equal to) a cache line (usually 64 bytes) in size, and your processor guarantees memory ordering within a cache line (most do):<p>1) Keep head and tail pointers local to the consumer and producer.<p>2) Associate a bit with each entry in the queue which denotes whether the entry is full or not. The bit must live within the same cache line as the entry itself.<p>3) Block on this bit, rather than head & tail pointers.<p>4) Set an entry's bit after filling the entry; clear it before.<p>5) You can now elide the memory fence (implicit in the .lazySet() method of the atomic objects). Performance will skyrocket.