The biggest argument for using cosine similarity is that hardware, software, and research have co-evolved to make it fast, robust, and well-understood.<p>As one simple example of that, most modern compilers can recognize false data dependency sharing and add some extra accumulators to the generated assembly for anything that looks like an inner product. For even slightly more complicated patterns though, that optimization is unlikely to have been implemented at a compiler level, so you'll have to do it yourself.<p>The author benchmarked, among other things, chebyshev distance. Here's two example (zig) implementations, one with an extra accumulator to avoid false sharing, making it better than 3x faster on my machine.<p><pre><code> // 742ns per vec (1536-dim random uniform data)
fn chebyshev_scalar_traditional_ignoreerrs(F: type, a: []const F, b: []const F) F {
@setFloatMode(.optimized);
var result: F = 0;
for (a, b) |_a, _b|
result = @max(result, @abs(_a - _b));
return result;
}
// 226ns per vec (1536-dim random uniform data)
fn chebyshev_scalar_sharing2_ignoreerrs(F: type, a: []const F, b: []const F) F {
@setFloatMode(.optimized);
var result0: F = 0;
var result1: F = 0;
var i: usize = 0;
while (i + 1 < a.len) : (i += 2) {
result0 = @max(result0, @abs(a[i] - b[i]));
result1 = @max(result1, @abs(a[i + 1] - b[i + 1]));
}
if (a.len & 1 == 1)
result0 = @max(result0, @abs(a[a.len - 1] - b[b.len - 1]));
return @max(result0, result1);
}
</code></pre>
This is apples to oranges, but if their chebyshev implementation were 3x faster after jitting it'd handily beat everything else.