I'm trying to explain the highest level reason why (modern) back propogation works better than previous methods (such as rosenblatts back propogation in 1958 - yes it's that old). without going into the calculus, i want to just look at the information side of things.<p>I want to say something like the following:<p>back in the 1960's we tried back propogation with binary neurons. And so when we tuned the parameters backwards (from output to input) no magnitude information passed back through the net (only DIRECTION - i.e. turn this knob left or right). this is similar to how we can't reverse an operation in modulo arithmetic. so it was a very 'coarse' training process due to a lack of information. (took forever)<p>When we moved to non-linear units (such as relu) there was now a direct relationships (or analog) between output and input magnitude, and so when we passed backwords through the net we had MAGNITUDE and DIRECTION information (i.e. turn this knob to the right by x). that allowed us to train the net much faster because information from every neuron touched every other neuron. put simply, "we knew how much to turn them, and what direction" during training.<p>thoughts? what am I glossing over?<p>thank you.
don't under estimate the 50Mx improvement in computer performance since 1970.<p>Also, Relu's help with Vanishing/Exploding Gradient problem which allows the information to propagate without sending it in to la la land.<p>CNNs helped because they don't have to calculate across a fully connected network.