I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with. If you read the paper, it goes on to say, in regards to human evaluation:<p>> Mismatch occurs mostly due to inclusion/exclusion
of non-essential phrases (e.g., monsoon trough versus
movement of the monsoon trough) rather than
fundamental disagreements about the answer.<p>I don't think I would call that "error," rather than ambiguity. In other words, there's more than one possible answer to the questions under these criteria -- English isn't a formal grammar where there's always one and only one answer. For instance, here's one of the questions from the ABC Wikipedia page:<p>> What kind of network was ABC when it first began?<p>> Ground Truth Answers: radio network radio radio network<p>> Prediction: October 12, 1943<p>Because the second human said "radio" instead of "radio network," I believe this would count as a human miss. But the answer is factually correct. Meanwhile, the prediction from the Stanford logistic regression (not the more sophisticated Alibaba model in the article, where I don't think results are published at this detail) is completely wrong. No human could make that mistake. And yet these are treated as equally flawed answers by the EM metric.<p>And yet this gets headlined as "defeats humans," not "learns to mimic human responses well."