A tale of an impossible bug: big.LITTLE and caching

445 pointsby rodrigokumperaover 8 years ago

18 comments

pm215over 8 years ago

Properly configured big.LITTLE clusters should be set up so that all CPUs report the same cache line size (which might be smaller than the true cache line size for some of the CPUs), to avoid exactly this kind of problem. The libgcc code assumes the hardware is correctly put together.There is a Linux kernel patchset currently going through review which provides a workaround for this kind of erratum by trapping the CTR_EL0 accesses to the kernel so they can be emulated with the safe correct value: <a href="http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1227904.html" rel="nofollow">http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg...</a> and it seems to me that that's really the right way to deal with this.

评论 #12482358 未加载

评论 #12484319 未加载

评论 #12482788 未加载

userbinatorover 8 years ago

Seeing bugs like this reminds me of how much nicer things are on x86 where JITs do not need to flush caches. You can actually modify the instruction immediately ahead of the currently executing one, and the CPU will naturally "do the right thing"[1] --- it does slow down execution, as the CPU is essentially automatically detecting and flushing its cache/pipeline, but used sparingly can be a great optimisation. The write can even come from another core ("cross-modifying code") and everything will still work. Someone I know used this to great effect in squeezing out the last bits of performance from an application by eliminating checks on a few flag variables and the associated branching in a tight loop --- it simply "poked" instruction bytes from another core into the loop when it was time for that core to do something else.[1] With the exception of pre-Pentium CPUs, where modifying at various forward offsets from the locus of execution could give insight into how big the prefetch queue is. With the Pentium it was fully detected and the later multithreaded/multicored react similarly to cross-modifying code as described above, which leads me to believe that Intel is very much supportive of these things as otherwise they could've just told programmers to do as ARM does.Maybe what ARM needs, short of doing it the Intel way, is a "flush region" instruction which takes both the address and size, so it can automatically flush the appropriate cache lines based on the current hardware's cacheline size.

评论 #12486535 未加载

anarazelover 8 years ago

Different cacheline sizes for the different cores seems like an absurdly bad idea. One because it opens one up to bugs like these, but also because it makes optimization a lot harder. I have a hard time believing the savings due to a larger line size are worth it.

评论 #12486054 未加载

评论 #12482667 未加载

ChuckMcMover 8 years ago

wow, just wow. That is a really awesome bug (and like the authors I have issues with trying to sleep when that sort of puzzle is sitting there :-)Still a bit hazy on why they manually flush the cache for a given block of memory (presumably for protecting disclosure?) but I'm also a bit curious how it works if you get the sequence big fetches a cache line, switches to little which fetches a line (half as long and changing half the bytes in the cache) and then you switch back to big and its thinking it has a full cache line? Presumably there is some mechanism that invalidates cache lines?

评论 #12482530 未加载

评论 #12482891 未加载

hexa00over 8 years ago

I had this problem too with GDB on the Odroid UX4 big.LITTLE SoC.Since GDB is patching the instrution with ptrace to insert a breakpoint for example.See my blog post about it: <a href="https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-arm-board-odroid-ux4-with-gdbgdbserver/" rel="nofollow">https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-a...</a>Or the post on the GDB mailling list: <a href="https://www.sourceware.org/ml/gdb/2015-11/msg00030.html" rel="nofollow">https://www.sourceware.org/ml/gdb/2015-11/msg00030.html</a>Too bad however that the kernel patchset mentionned in a previous post only covers arm64..So it's still a problem from arm32.

wolfgkeover 8 years ago

> Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore.I rather see the problem in the fact that there seems to be no possibility to say to the Linux scheduler: Only schedule this process/thread between cores that have the same cache line size. Or add an attribute when some thread is created on the cache line size of the cores it is allowed to run. Or an attribute when some thread is created for "allow arbitrary cache line size but don't let it run on cores with a different size". This way it would suffice to check for the cache size on program or thread start.

评论 #12483496 未加载

评论 #12483537 未加载

评论 #12483263 未加载

评论 #12483698 未加载

评论 #12486769 未加载

sjmulderover 8 years ago

Huge props to the team for finding this out. What a nasty issue.One question about the intro, which states that this is the first mass produced AMP architecture, but isn't the PlayStation 3's Cell CPU one?

评论 #12482357 未加载

评论 #12482273 未加载

Animatsover 8 years ago

"first mass produced AMP architecture"Nope. Remember the Cell? The processor in the Playstation 3? One main CPU with 8 little CPUs and no shared memory, just channels.The Playstation 4 isn't a AMP machine because programming the Cell was so hard.

评论 #12484081 未加载

评论 #12485958 未加载

dsp1234over 8 years ago

It appears the that the caching code was added in this patch:<a href="https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html" rel="nofollow">https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html</a>Prior to that, the call:<pre><code> asm volatile ("mrs\t%0, ctr_el0":"=r" (cache_info)); </code></pre> was always made.

评论 #12482224 未加载

funny_falconover 8 years ago

<a href="https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html" rel="nofollow">https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html</a>

评论 #12488138 未加载

SixSigmaover 8 years ago

There's no such thing as a simple cache bug. - Rob PikeCaches are bugs waiting to happen. Rob Pike ‏@rob_pike 21 Mar 2014

评论 #12483578 未加载

评论 #12484055 未加载

AceJohnny2over 8 years ago

As an embedded developer also working on big.LITTLE arm64...I would do nasty, painful things to the designer of such a system.

randyrandover 8 years ago

When would a programmer explicitly need to invalidate the CPU cache? Does this not happen automatically on a context switch?

评论 #12484520 未加载

xxie24over 8 years ago

From the pseudo code, what is disadvantage that making get_current_cpu_cache_line_size() always get called?

评论 #12482064 未加载

评论 #12482037 未加载

评论 #12482136 未加载

评论 #12482041 未加载

Mizzaover 8 years ago

This is an excellent bug journey, and I'm even more impressed that the resulting discovery has already been used to improve Dolphin. A testament to the quality of both projects.

flamedogeover 8 years ago

I wonder why no one tried to validate Asymmetric MultiProcessing by first validating all cases with either little or big first. And then bisect further down when both are enabled.

评论 #12484757 未加载

GrumpyNlover 8 years ago

I don't understand that level of coding, what i do understand is the great way of debugging. Its all about deduction mr Watson.

评论 #12482352 未加载

betolinkover 8 years ago

Can someone explain why cache flush is used for on ARM or in general low level programming?

评论 #12486680 未加载