It's a high-dimensional correlation machine. In other words, it's an attempt to learn "how to recognize patterns" by learning how to represent each one as orthogonally as possible to each other one. This happens at each layer, and how many layers you need depends on how "mixed up" the transformed space is with respect to the appropriate labels following each linear transformation. Once they are suitably linearly separable following the feedforward pass through the network, you need one more layer to identify how the pattern maps to the output space.<p>Another way to think of it is that each layer learns a maximally efficient compression scheme for translating the data at the input layer to that at the output layer. Each layer learns a high-dimensional representation of the output that uses minimum bits for maximum information reconstruction capacity. There was a great talk given recently by Naftali Tishby where he explains this in great detail.[1]<p>Having the math is great to know <i>how</i> it works on a granular level. I've found that also explaining it in such holistic terms serves a great purpose by fitting "linear algebra + calculus" into an understanding of NNs that is greater than the sum of their parts.<p>[1] <a href="https://www.youtube.com/watch?v=bLqJHjXihK8" rel="nofollow">https://www.youtube.com/watch?v=bLqJHjXihK8</a>
Interesting perspective on speed up neural networks <a href="https://semiengineering.com/speeding-up-neural-networks/" rel="nofollow">https://semiengineering.com/speeding-up-neural-networks/</a>