Again someone who relies on undefined behavior.
Casting pointer of wrong alignement is not a platform specific behavior, it's an undefined behavior. Relying on it is an error.<p>The author did not know "What Every C Programmer Should Know About Undefined Behavior": <a href="http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html" rel="nofollow">http://blog.llvm.org/2011/05/what-every-c-programmer-should-...</a><p>Another good link about that: <a href="http://blog.regehr.org/archives/213" rel="nofollow">http://blog.regehr.org/archives/213</a>
The correct solution for GCC is specifying 1-byte alignment for this particular array:<p><pre><code> #include <stdlib.h>
#include <stdint.h>
typedef uint32_t __attribute__((__aligned__(1))) uint32_t_unaligned;
uint64_t sum (const uint32_t_unaligned * p, size_t nwords)
{
uint64_t res = 0;
size_t i;
for (i = 0; i < nwords; i++) res += p [i];
return res;
}
</code></pre>
Probably works on clang too and IIRC the MS compiler provides similar functionality with different syntax. AFAIK there is no portable solution.<p>And I'm not sure how exactly this code will fail on architectures which don't support unaligned uint32_t.
These SSE instructions that operate only on aligned data are a pain. It's not well known that Linux/x86 stack frames must always be 16 byte aligned. GCC uses this knowledge to use the SSE aligned instructions when accessing certain fields on the stack.<p>Unfortunately a while back the OCaml compiler generated non-aligned stack frames. Which is no problem for pure OCaml code and even saves a little bit of memory. However if the code called out to C, then <i>sometimes</i> and unpredictably (think different call stacks, ASLR) the C code would crash. That was a horrible bug to track down:<p><a href="https://caml.inria.fr/mantis/view.php?id=5700#c10779" rel="nofollow">https://caml.inria.fr/mantis/view.php?id=5700#c10779</a>
Well, that's pretty horrendous. Note that the naive code which just casts the input to uint16_t would work fine. I can't help but wonder if the solution to this might have been better expressed as naive implementation + platform-specific <i>assembly</i> implementation.<p>After all, if you have to understand the underlying instructions executed in order to fix the problem, why not stop trying to make the compiler emit the "right" instructions and just write them yourself?<p>(Language lawyers: is casting a char* to a uint32_t* actually defined behavior? For unaligned data?)
Compiler is allowed to assume alignment of pointers (what are you doing is creating a pointer to a value with invalid alignment, hence undefined behaviour (just creating a pointer is undefined behaviour)). The correct solution would be to read values indirectly. For example, a function like that could be used to replace every access to "q" variable.<p><pre><code> static uint32_t read(const char *p, size_t index) {
uint32_t out;
memcpy(&out, &p[index * sizeof out], sizeof out);
return out;
}
</code></pre>
A compiler can recognize this pattern, and continue to use unaligned accesses that would work.<p>This has a cost of unaligned accesses on non-x86 platforms (a quite big at that), but considering the original code didn't work on these at all, it's an improvement.
Note that even if you try to manually correct the pointer to work on aligned data (read any initial bytes via char pointer and read the rest via uint32_t pointer), you still generally have undefined behavior: strict-aliasing violation. And the worst thing here is that whether you do have a violation depends on how <i>other</i> code accesses the same data / how the object is initially declared. E.g., you're fine if the original declaration is char[] or uint32_t[], but not if it's uint16_t[]. Because that would entail access to the same data via both uint16_t and uint32_t, a violation of strict-aliasing.<p>Actually two out of three inet checksum implementations in lwIP have this bug [1].<p>And like the problem discovered in the article, this is NOT theoretical. I have personally seen code "miscompiled" due to strict aliasing violations (in that case, packed structures were involved).<p>I think the only way to do this "manual alignment handling" is to use assembly, either by writing the entire thing in assembly, or using inline asm sections for doing the individual 32-bit memory reads/writes.<p>Funny story... When I was looking for a fast inet checksum implementation to use for an embedded ARM project, I took the one from RTEMS, which is written in C with much inline asm, and like the lwIP code, it has strict aliasing violations (and also problems compiling correctly with clang). What I did was, compiled it to assembly with gcc once, then included this compiled assembly in the source code. Assuming that this was compiled correctly, I don't need to be afraid of future compiler change breaking it.<p>[1] <a href="http://git.savannah.gnu.org/cgit/lwip.git/tree/src/core/inet_chksum.c" rel="nofollow">http://git.savannah.gnu.org/cgit/lwip.git/tree/src/core/inet...</a>
Related Snabb experiments with IP checksum in C with automatic vectorization, C with vector intrinsics, and AVX2 assembler: <a href="https://github.com/snabbco/snabb/pull/899" rel="nofollow">https://github.com/snabbco/snabb/pull/899</a>
I'm taking assembly right now, and we're working on our first RISC project after spending all semester working with the x86. Why does RISC crash if the bytes are not aligned?
If you're willing to use compiler extensions, you can avoid the memcpy by using packed structs. This can generate better code.<p>Folly has a generic `loadUnaligned()` that uses this trick: <a href="https://github.com/facebook/folly/blob/5d52fb8c30e567403b8ccb65e5c1a159fb92d707/folly/Bits.h#L539" rel="nofollow">https://github.com/facebook/folly/blob/5d52fb8c30e567403b8cc...</a>
What if you put the array in a struct and made a union of both uint32_t and uint8_t? Would the union with the larger size force the compiler to generate a 4-byte aligned array for the bytes?<p>I suggest this because it would be portable without any compiler specific stuff.
So much HTML to complain about C working the way C is defined rather than the way the OP wants it to work! It's not that hard to write a fast ones'-complement checksum that's portable and compliant, but whining's always easier than coding.