Properly configured big.LITTLE clusters should be set up so that all CPUs report the same cache line size (which might be smaller than the true cache line size for some of the CPUs), to avoid exactly this kind of problem. The libgcc code assumes the hardware is correctly put together.<p>There is a Linux kernel patchset currently going through review which provides a workaround for this kind of erratum by trapping the CTR_EL0 accesses to the kernel so they can be emulated with the safe correct value:
<a href="http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1227904.html" rel="nofollow">http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg...</a>
and it seems to me that that's really the right way to deal with this.
Seeing bugs like this reminds me of how much nicer things are on x86 where JITs do not need to flush caches. You can actually modify the instruction immediately ahead of the currently executing one, and the CPU will naturally "do the right thing"[1] --- it does slow down execution, as the CPU is essentially automatically detecting and flushing its cache/pipeline, but used sparingly can be a great optimisation. The write can even come from another core ("cross-modifying code") and everything will still work. Someone I know used this to great effect in squeezing out the last bits of performance from an application by eliminating checks on a few flag variables and the associated branching in a tight loop --- it simply "poked" instruction bytes from another core into the loop when it was time for that core to do something else.<p>[1] With the exception of pre-Pentium CPUs, where modifying at various forward offsets from the locus of execution could give insight into how big the prefetch queue is. With the Pentium it was fully detected and the later multithreaded/multicored react similarly to cross-modifying code as described above, which leads me to believe that Intel is very much supportive of these things as otherwise they could've just told programmers to do as ARM does.<p>Maybe what ARM needs, short of doing it the Intel way, is a "flush region" instruction which takes both the address <i>and</i> size, so it can automatically flush the appropriate cache lines based on the current hardware's cacheline size.
Different cacheline sizes for the different cores seems like an absurdly bad idea. One because it opens one up to bugs like these, but also because it makes optimization a lot harder. I have a hard time believing the savings due to a larger line size are worth it.
wow, just wow. That is a really awesome bug (and like the authors I have issues with trying to sleep when that sort of puzzle is sitting there :-)<p>Still a bit hazy on <i>why</i> they manually flush the cache for a given block of memory (presumably for protecting disclosure?) but I'm also a bit curious how it works if you get the sequence big fetches a cache line, switches to little which fetches a line (half as long and changing half the bytes in the cache) and then you switch back to big and its thinking it has a full cache line? Presumably there is some mechanism that invalidates cache lines?
I had this problem too with GDB on the Odroid UX4 big.LITTLE SoC.<p>Since GDB is patching the instrution with ptrace to insert a breakpoint for example.<p>See my blog post about it: <a href="https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-arm-board-odroid-ux4-with-gdbgdbserver/" rel="nofollow">https://www.kayaksoft.com/blog/2016/05/11/random-sigill-on-a...</a><p>Or the post on the GDB mailling list: <a href="https://www.sourceware.org/ml/gdb/2015-11/msg00030.html" rel="nofollow">https://www.sourceware.org/ml/gdb/2015-11/msg00030.html</a><p>Too bad however that the kernel patchset mentionned in a previous post only covers arm64..<p>So it's still a problem from arm32.
> Worse, not even the ARM ISA is ready for this. An astute reader might realize that computing the cache line on every invocation is not enough for user space code: It can happen that a process gets scheduled on a different CPU while executing the __clear_cache function with a certain cache line size, where it might not be valid anymore.<p>I rather see the problem in the fact that there seems to be no possibility to say to the Linux scheduler: Only schedule this process/thread between cores that have the same cache line size. Or add an attribute when some thread is created on the cache line size of the cores it is allowed to run. Or an attribute when some thread is created for "allow arbitrary cache line size but don't let it run on cores with a different size". This way it would suffice to check for the cache size on program or thread start.
Huge props to the team for finding this out. What a nasty issue.<p>One question about the intro, which states that this is the first mass produced AMP architecture, but isn't the PlayStation 3's Cell CPU one?
<i>"first mass produced AMP architecture"</i><p>Nope. Remember the Cell? The processor in the Playstation 3? One main CPU with 8 little CPUs and no shared memory, just channels.<p>The Playstation 4 isn't a AMP machine because programming the Cell was so hard.
It appears the that the caching code was added in this patch:<p><a href="https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html" rel="nofollow">https://gcc.gnu.org/ml/gcc-patches/2012-09/msg00076.html</a><p>Prior to that, the call:<p><pre><code> asm volatile ("mrs\t%0, ctr_el0":"=r" (cache_info));
</code></pre>
was always made.
This is an excellent bug journey, and I'm even more impressed that the resulting discovery has already been used to improve Dolphin. A testament to the quality of both projects.
I wonder why no one tried to validate Asymmetric MultiProcessing by first validating all cases with either little or big first. And then bisect further down when both are enabled.