MetaMind papers are always pretty awesome. Couple of great highlights from this paper:<p>The visual saliency maps in Figure 6 are astounding and make their performance on this task even more impressive as they give a lot of insight into what the model is doing and it seems to be focusing on the things in the image that a regular person would use to decide on the answer. Most striking was on the question "is this in the wild?", and the saliency was on the artificial, human structures in the background that indicated it was in a zoo. This type of reasoning is surprising as it requires a bit of a reversal in logic to come up with this way of answering the question. Super impressive.<p>The proposed input fusion layer is pretty cool - allowing information from future sentences to be used to condition how to processes previous sentences. This type of information combining was previously not really explored, and it makes sense that it improves the performance on the bAbI-10k task so much as back-tracking is an important tool for human reading comprehension. Also its clever that they encode each sentence as a thought vector before compositing so they can be processed both forwards and backwards with shared parameters- doing so on just one-hot words or even ordered word embeddings would require two vastly different parameters since grammar is wildly different when a sentence is reversed.<p>Lastly, on a side note, if 2014 was the year of CNN's, 2015 the year of RNN's, it looks like 2016 is the year of Neural Attention Mechanisms. Excited to see what new layers are explored in 2016 that will dominate 2017.