Whats amazing to me is that if I understand correctly backprop still works. It is very odd that SGD on the error function for some training data is conceptually equivalent to teaching all the gates for each hidden feature when to open/close given the next input in a sequence.
colah, your posts combine a deep level of understanding and an exceptional clarity. These are both rare, especially in the cargo-cult-driven world of neural networks and deep learning.<p>I hope you keep writing as much as you can. Thanks!
Thanks colah, that was a very readable walk-through. I've been making my way through Bishop's PRML ch 5 to get as much of a handle as possible on NNs, but your intro here to LSTM's makes me want to jump ahead and skip to the new stuff :)