That's really cool, a good breakdown on what's actually going on with an ann. But I don't think using assembly will help with speed, wouldn't you still be better off using matrix multiplication on a gpu, written with something like theano in python? Then again, maybe you just used assembly to explain things at a very low level, rather than for any speed boost. Either way, very cool article.