科技回声

1 comment

I think this paper is the key to the next speedup in local LLM inference. By making the model sparse (using the ReLU activation), we can save around 80% of memory accesses and computations of the Feed Forward Layers. ReLU sets the output of a layer to 0 when it's negative, and since any number multiplied by zero is zero, the next layer doesn't need to load the rows of the weight matrix that would be zero after the multiplication.<p>Unfortunately there aren't a lot of models currently trained with ReLU activation.

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models(2023)

1 comment

ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models(2023)

1 comment