I'm relatively novice to machine learning but here's my best attempt to summarize what's going on in layman's terms. Please correct me if I'm wrong.<p>- Encode the words in the source (aka embedding, section 3.1)<p>- Feed every run of k words into a convolutional layer producing an output, repeat this process 6 layers deep (section 3.2).<p>- Decide on which input word is most important for the "current" output word (aka attention, section 3.3).<p>- The most important word is decoded into the target language (section 3.1 again).<p>You repeat this process with every word as the "current" word. The critical insight of using this mechanism over an RNN is that you can do this repetition in parallel because each "current" word does not depend on any of the previous ones.<p>Am I on the right track?
> Facebook's mission of making the world more open<p>That's a rather strong statement, for a company that has become one of the world's most complained-about black boxes.<p>But yes, they have done a lot of good in the computer science space.
As far I understood it, Facebook put lots of research into optimizing a certain type of neural network (CNN), while everyone else is using another type called RNN. Up until now, CNN was faster but less accurate. However FB has progressed CNN to the point where it can compete in accuracy, particularly in speech recognition. And most importantly, they are releasing the source code and papers. Does that sound right?<p>Can anyone else give us an ELI5?
In this work Convolution Neural Nets (spatial models that have a weakly ordered context, as opposed to Recurrent Neural Nets which are sequential models that have a strongly ordered context) are demonstrated here to achieve State of the Art results in Machine Translation.<p>It seems the combination of gated linear units / residual connections / attention was the key to bringing this architecture to State of the Art.<p>It's worth noting that previously the QRNN and ByteNet architectures have used Convolutional Neural Nets for machine translation also. IIRC, those models performed well on small tasks but were not able to best SotA performance on larger benchmark tasks.<p>I believe it is almost always more desirable to encode a sequence using a CNN if possible as many operations are embarrassingly parallel!<p>The bleu scores in this work were the following:<p>Task (previous baseline): new baseline<p>WMT’16 English-Romanian (28.1): 29.88
WMT’14 English-German (24.61): 25.16
WMT’14 English-French (39.92): 40.46
This smells of "we built custom silicon to do fast image processing using CNNs and fully connected networks, and now we want to use that same silicon for translations. "
I wonder if they can combine this with bytenet (dilated convolutons in place of vanilla convs) - gives you a larger FOV and add in attention and then you probably have a new SOTA.
TLDR: Cutting edge accuracy, nine times faster than previous state of the art, published models and source code.<p>But go read the article- nice animated diagrams in there.