Hey guys, I'm the writer. As you can see from the post, I'm still very much learning.<p>What I want the most from this site is for more experienced people to help me out with some of my questions.<p>Here they come:<p>- Can you use Batch Normalization (the one from tf.keras) on an LSTM layer? Or will it break the model?<p>- How do you deal with extremely infrequent words if you do a word-based LSTM (with a one-hot encode of each word in the corpus?)? Do you remove them? Replace them? Cluster them?<p>- Do you think there's any other architecture that would've had better results -while still not taking too long to train-?