I'm not sure "we scraped ELI5"[1] is really such a substantive advancement of the state-of-the-art that it deserves such a large write up. The Stanford Question & Answer Dataset is much more carefully curated.[2]<p>ROGUE[3] and BLEU[4] are pretty meaningful metrics for translations and for fairly short answers that that really only be phrased one way. For example, "What is the biggest mammal?" should be answered "The Blue Whale." There is little room for ambiguity: the words "Blue" and "Whale" <i>must</i> appear, as must the bigram "Blue Whale" for the answer to be correct. For a large or complex answer, the situation is different. Metrics based on word overlap like ROGUE and BLEU must either incentive memorizing the answer given in the training set (overfitting) or the inappropriately penalize semantically equivalent answers. For example, for the question "why is the sky blue?" if the algorithm produces "The sky is blue because Raleigh scattering off of water droplets preferentially scatters blue light at right angles. This is also why sunsets are red." and the answer on file is "light with long wavelengths passes straight through moist air, while light with short wavelengths tends to be deflected." Both answers are correct - indeed they are basically the same answer - yet they share so few words, bigrams, and trigrams that they would have to marked "wrong."<p>[1]: <a href="https://www.reddit.com/r/explainlikeimfive/" rel="nofollow">https://www.reddit.com/r/explainlikeimfive/</a><p>[2]: <a href="https://rajpurkar.github.io/SQuAD-explorer/" rel="nofollow">https://rajpurkar.github.io/SQuAD-explorer/</a><p>[3]: <a href="https://en.wikipedia.org/wiki/ROUGE_(metric)" rel="nofollow">https://en.wikipedia.org/wiki/ROUGE_(metric)</a><p>[4]: <a href="https://en.wikipedia.org/wiki/BLEU" rel="nofollow">https://en.wikipedia.org/wiki/BLEU</a>