Looking at the C code, here are a few suggestions:<p>- __builtin_popcount() and __builtin_popcountll() are great gcc builtins which perform bit population counts, useful for hamming distances (after the XOR operation, of course). [1]<p>- avoid scanf(), which is costly. My approach was to generate a header file with the char * words[] array, and also a int words_len[] with, you guessed it, the pre-computed lengths. Then go for the memcpy() route and you're in a much better position.<p>- speaking of memcpy, gcc provides one that should be optimal in most cases, but for really short words, it might be easiest to write your own. I came across a good article on it [2] but this is what I ended up using:<p><pre><code> void
c_memcpy (char * dst, char * src, size_t len)
{
__builtin_prefetch(dst, 1, 3);
while (len--) {
*dst++ = *src++;
}
}
</code></pre>
[1] <a href="http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Other-Builtins.html" rel="nofollow">http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Other-Builtins.h...</a><p>[2] <a href="http://www.embedded.com/columns/technicalinsights/19205567" rel="nofollow">http://www.embedded.com/columns/technicalinsights/19205567</a>
Great writeup, thanks! I'm confused by your conclusion, though: "I wish I would have identified Rackspace initially as the platform of choice."<p>From the results of the contest, I would have concluded that the platform of choice was CUDA. Do you think otherwise? Is there a reason you'd choose to use a flotilla of Rackspace servers for a problem like this rather than a graphics card? If you are going to be programming in C anyway, it seems like a much simpler approach.