It seems to me that there is a good speedup possible if the GPU and HBM had
a 'cache bypass'. That is, there are likely a large number of frequent matrix
multiplies that could be computed by hardware lookup rather than an actual
multiply. Such a pre-multiply cache would free up more of the actual multiply
hardware, substituting the cache response for the result.<p>This 'memoizing' trick is widely used in compute-intensive situations but I'm
unaware of any GPU/HBM hardware to support this.<p>Given that the multiplies are now computing 4 or 8 bit results this seems like
a reasonable number of matrix multiplies could be cached.