Hi I shaved off 40% more just by replacing multiplication with bitwise operations and ported the code to a standard Python C Extension to avoid relying on JIT.<p><a href="https://github.com/szabolcsdombi/optimized-floyd-steinberg-dithering">https://github.com/szabolcsdombi/optimized-floyd-steinberg-d...</a><p>X * 7 is equal to (X << 3) - X<p>X * 3 is equal to (X << 1) + X<p>X * 5 is equal to (X << 2) + X