You learn something new every day: "The second major issue with ordering we have to be aware of is a thread could write a variable and then, if it reads it shortly after, could see the value in its store buffer which may be older than the latest value in the cache sub-system."<p>I was skeptical hippo about this because x86 has such a strongly ordered memory model, but lo and behold: "HOWEVER. If you do the sub-word write using a regular store, you are now
invoking the _one_ non-coherent part of the x86 memory pipeline: the store buffer. Normal stores can (and will) be forwarded to subsequent loads from the store buffer, and they are not strongly ordered wrt cache coherency while they are buffered." (Linus, <a href="http://yarchive.net/comp/linux/store_buffer.html" rel="nofollow">http://yarchive.net/comp/linux/store_buffer.html</a>).
Seems like a nice article, though I have 2 nits:<p>1) I'd have liked if you would have dived into those cases where you do actually flush the CPU cache. I've run into this maybe once or twice in my entire career, and this was while doing MIPS kernel drivers. I'm guessing it would be cool for the audience to understand what shenanigans are needed to actually require it, particularly as more people will be transitioning from x86 to ARM.<p>2) You are ascribing meaning to volatile which it absolutely does not have (in C/C++). You really should be going deeper into read/store memory barriers. Using volatile in the hope that this allows you to do some kind of synchronization is misguided.
"Even from highly experienced technologists I often hear talk about how certain operations cause a CPU cache to "flush"."<p>Ok.<p>"This style of memory management is known as write-back whereby data in the cache is only written back to main-memory when the cache-line is evicted because a new line is taking its place."<p>That sounds like a flush to me. Modified data is written back (or flushed) to main memory.<p>"I think we can safely say that we never "flush" the CPU cache within our programs."<p>Maybe not explicitly, but a write-back is triggered by "certain operations," (see the first quotation above).<p>So it sounds like the real "fallacy" the article is discussing is the idea that a cache flush is something that a program explicitly does. That would indeed be a fallacy, but I have never heard anyone claim this.<p>On the upside, the article does give a lot of really nice details about the memory hierarchy of modern architectures (this stuff falls out of date quickly). I had no idea the Memory Order Buffers could hold 64 concurrent loads and 36 concurrent stores.
"When hyperthreading is enabled these registers are shared between the co-located hyperthreads."<p>How does that work? I thought each HT had its own registers - otherwise, wouldn't that add a lot of complication and overhead? And does that mean if I disable HT, a program can double the available registers? Wouldn't that need different machine code?
<i>I think we can safely say that we never "flush" the CPU cache within our programs.</i><p>Perhaps true, but not for lack of trying!<p>For benchmarking compression algorithms from a cold cache, I've been trying to intentionally flush the CPU caches using WBINVD (Write Back and Invalidate Cache) and CLFLUSH (Cache Line Flush). I'm finding this difficult to do, at least under Linux for Intel Core i7.<p>1) WBINVD needs to be called from Ring 0, which is the kernel. The only way I've found call this instruction from user space is with a custom kernel module and an ioctl(). This works, but feels overly complicated. Is there some built in way to do this?<p>2) CLFLUSH is straightforward to call, but I'm not sure it's working for me. I stride through the area I want uncached at 64 byte intervals calling __mm_clflush(), but I'm not getting consistent results. Is there more that I need to do? Do I need MFENCE both before and after, or in the loop?
<a href="http://linux.die.net/man/2/cacheflush" rel="nofollow">http://linux.die.net/man/2/cacheflush</a><p>There are other CPU architectures than the Nehalem.<p>The Nehalem's IO hub connects to the caches, not to memory, via the QPI. I/O sees the data as the caches see it. Most architectures, historically, have the IO systems talking to main memory or a memory controller independent of the CPU. Not waiting for memory to be consistent before firing off a DMA was a great way to get "interesting" visual effects.<p>We called that process "flushing the cache".
"If our caches are always coherent then why do we worry about visibility when writing concurrent programs? This is because within our cores, in their quest for ever greater performance, data modifications can appear out-of-order to other threads"<p>I've read the opposite from various sources such as JCIP - a single unsynchronized-nonvolatile write could never be noticed by other threads (i.e. processors). I don't think that case falls into the "instruction reordering" category, does it?
I think people do detect cache issues. Considering the fact that an access to DRAM is like ~150 or more cycles slower... your programs if written even normally may starve for RAM.<p>Also, people sometimes people confuse TLB flushing with overall cache flushing as mentioned at the end of your article. Also, tagged TLB flushing is still not commonly used (to my knowledge)<p>The reality is that there is a huge performance hit whenever these systems need to be used by a new process or context. Maybe people are equating this experience to 'flushing'.