LSTMs are both amazing and not quite good enough. They seem to be too complicated for what they do well, and not quite complex enough for what they can't do so well. The main limitation is that they mix structure with style, or type with value. For example, if you want an LSTM to learn addition, if you taught it to operate on numbers of 6 digits it won't be able to generalize on numbers of 20 digits.<p>That's because it doesn't factorize the input into separate meaningful parts. The next step in LSTMs will be to operate over relational graphs so they only have to learn function and not structure at the same time. That way they will be able to generalize more between different situations and be much more useful.<p>Graphs can be represented as adjacency matrices and data as vectors. By multiplying vector with matrix, you can do graph computation. Recurring graph computations are a lot like LSTMs. That's why I think LSTMs are going to become more invariant to permutation and object composition in the future, by using graph data representation instead of flat euclidean vectors, and typed data instead of untyped data. So they are going to become strongly typed, graph RNNs. With such toys we can do visual and text based reasoning, and physical simulation.
I personally find recurrent highway networks (RHNs) as described in [1] to be easier to understand and remember the formulas for than the original LSTM. Because as they are generalizations of LSTM, if one understands RHNs, one can understand LSTMs as just a particular case of RHN.<p>Instead of handwaving about "forgetting", it is IMO better to understand the problem of vanishing gradients and how can forget gates actually help with them.<p>And Jürgen Schmidhuber, the inventor of LSTM, is a co-author of the RHN paper.<p>[1] <a href="https://arxiv.org/abs/1607.03474" rel="nofollow">https://arxiv.org/abs/1607.03474</a>
In the experiment on teaching an LSTM to count, it's useful to note that the examples it's trained on are derivations [1] from a grammar a^nb^n (with n > 0), a classic example of a Context-Freee Grammar (CFG).<p>It's well understood that CFGs can not be induced from examples. Which accounts for the fact that LSTMs cannot learn "counting" in this manner, nor indeed can any other learning method that learns from examples.<p>_______________<p>[1] "Strings generated from"<p>[2] The same goes for any formal grammars other than finite ones, as in simpler than regular.
LSTMs are on their retour in my opinion. They are a hack to make memory in recurrent networks more persistent. In practice they overfit too easy. They are being replaced with convolutional networks. Have a look at the latest paper from Facebook about translation for more details.
Really great work on visualizing neurons!<p>Is anyone working with LSTMs in a production setting? Any tips on what are the biggest challenges?<p>Jeremy Howard said in fast.ai course that in the applied setting, simpler GRUs work much better and has replaced LSTMs. Comments about this?
Is there code for the coloring of neurons per-character as in the post? I've seen that type of visualization on similar posts and am curious if there is a library for it. (the original char-rnn post [<a href="http://karpathy.github.io/2015/05/21/rnn-effectiveness/" rel="nofollow">http://karpathy.github.io/2015/05/21/rnn-effectiveness/</a>] indicates that it is custom Code/CSS/HTML)
Google Brain outperforms LSTMs with Convolutional Networks in speed and accuracy, seeming to confirm LSTMs are not optimal for NLP at least:<p><a href="https://arxiv.org/pdf/1706.03762.pdf" rel="nofollow">https://arxiv.org/pdf/1706.03762.pdf</a>
Is the code for generating the reactions from the LSTM hidden units posted anywhere? That was the best part for me and I'd love to use it in my own projects.
LSTM is "Long Short Term Memory," since the tutorial never mentions what it stands for.<p><a href="https://en.wikipedia.org/wiki/Long_short-term_memory" rel="nofollow">https://en.wikipedia.org/wiki/Long_short-term_memory</a>