I wish I had put more benchmarks in the articles to show its application and how it performs. I had some interesting findings on the performance differences between CPU(unrolled, transposition before matmul etc), GPU and also simd code using WebAssembly.
I hope to put in another article as this blog was already too long.