TechEcho

6 comments

tbirdzover 8 years ago

I know the author doesn't intend this code to be "production ready", but I just wanted to point a problem that may not be completely obvious, if you are trying to use this structure for multithreaded communication.The structure declares the read and write indices like so:<pre><code> std::atomic<IndexT> read_idx; std::atomic<IndexT> write_idx; </code></pre> These two variables are going to be stored next to each other in memory. If IndexT is say a 32-bit int, then the read_idx and write_idx will be 4 bytes each. A cache line is 64 bytes on intel, so these two variables are going to end up on the same cache line.The problem with this, is that cache coherence protocols work on the granularity of lines, not bytes. So when one core writes to anywhere on the line (for example the producer to write_idx), then the other core (say the consumer core) has to invalidate its cache entry for that line, and get the new value off the bus, or maybe even hit memory.Regardless of the specifics, the point is the consumer and producer cores can no longer independently write to their respective variables. Whenever they write to their variable, it will cause some cache coherence operations to be done to update the line that variable is in, in the other core.This is false sharing, since the producer and consumer don't actually need to access the others variable. The cores are just forced to share access to the variables because the variables are on the same cache line.This can really have a big impact for multithreaded workloads, and bring down the overall thread throughput.One solution is to add some padding between the variables. Something like:<pre><code> std::atomic<IndexT> read_idx; unsigned char pad[64]; std::atomic<IndexT> write_idx; </code></pre> will do. That will put the variables on different cache lines. And you might want to pad out some other variables to different cache lines as well.

评论 #13189306 未加载

评论 #13188842 未加载

评论 #13189728 未加载

评论 #13189305 未加载

评论 #13189456 未加载

rkevingibsonover 8 years ago

I'm curious what you think the problems are that unbounded indices causes? Unsigned overflow is well defined in C++, so I don't see the problem here. AFAIK, the only issue would be with requiring power of 2 sizes.

评论 #13187424 未加载

jessaustinover 8 years ago

Why is push() checking for overlap? I thought not caring about that was the point of a ring buffer?[EDIT:] To clarify for those who didn't (couldn't?) read TFA's first sentence, it specifically invokes "I've been writing ring buffers wrong all these years" [0], recently featured on HN.[0] <a href="https://www.snellman.net/blog/archive/2016-12-13-ring-buffers/" rel="nofollow">https://www.snellman.net/blog/archive/2016-12-13-ring-buffer...</a>

评论 #13186822 未加载

评论 #13186757 未加载

评论 #13186809 未加载

deckerover 8 years ago

I didn't see the referenced article and initially thought this was going to be a post about not using volatile instead of atomic load / store.I suppose this could be a useful implementation, however, it's not clear how useful it would be since the main place it has an impact is if you have a large number of very small queues. The overhead associated with the unused element is sizeof(T) / (sizeof(T) * n + sizeof(RingBuffer) + malloc_overhead), where T is likely to be at most the size of a pointer on the platform. If we assume a standard x86-64 platform with a sizeof(T) to be 8, we can eyeball the struct and guess that GCC will probably spit out a 24 byte structure since it's going to want 8-byte alignment on the T* after the uint32 values. On top of that, not counting malloc overhead, the array will take up n * sizeof(T) bytes. For regular malloc, the overhead on an 8 byte allocation is going to be an extra 12 bytes (according to malloc.c). Assuming we are allocating our ring buffer on the stack, this brings our total size up to 48 bytes for the "correct" ring buffer vs 56 bytes for the "wrong" ring buffer. Actual savings will be at most 14% and go to 0% as the size of the buffer goes to infinity, proportional to 1 / n, which is less than what one might assume. Consequently, this means that for any buffer over length 2, the power of two buffer size requirement will likely waste more memory than it saves since we are saving at most 8 bytes, but losing the difference between a power of two and the desired queue size, so odds are it makes more sense to write a 1 or 2 element ring buffer as a special case than to use this implementation for all ring buffers.

评论 #13190391 未加载

jstapelsover 8 years ago

I honestly don't see how any of these implementations are better than simply using a bool to track whether the queue is empty or not (as suggested in one of the comments on the original article). You could even use the highest bit of the write index to store the bool value if memory was an issue.

评论 #13187674 未加载

评论 #13187874 未加载

评论 #13189081 未加载

andarsover 8 years ago

This link: <a href="http://www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIFO1.pdf" rel="nofollow">http://www.sunburst-design.com/papers/CummingsSNUG2002SJ_FIF...</a> was posted in the previous discussion on ring buffers. It provides a solution that seems to me to be the best of any I have seen. See page 2, section 2.2. For a buffer of size 2^N, use N+1 bit indices. If all bits of the indices are equal, the buffer is empty. If all but the most significant are equal, the buffer is full.

Yes, You Have Been Writing SPSC Queues Wrong

6 comments

Yes, You Have Been Writing SPSC Queues Wrong

6 comments