This is not quite human-level question-answering in the everyday sense of those words. The ZDNet headline is too clickbaity for my taste.<p>The answer to every question in the test is a preexisting snippet of text, or "span," from a corresponding reading passage shown to the model. The model has only to select which span in the reading passage gives the best answer -- i.e., which sequence of words already in the text best answers the question.[a]<p>Actual current results:<p><a href="https://rajpurkar.github.io/SQuAD-explorer/" rel="nofollow">https://rajpurkar.github.io/SQuAD-explorer/</a><p>Paper describing the dataset and test:<p><a href="https://arxiv.org/abs/1606.05250" rel="nofollow">https://arxiv.org/abs/1606.05250</a><p>[a] If this explanation isn't entirely clear to you, it might help to think of the problem as a challenging classification task in which the number of possible classes for each question is equal to the number of possible spans in the corresponding reading passage.
Great result. At my job I manage a machine learning team and so I am fairly much all-in for deep learning to solve practical problems.<p>That said, I think the path to 'real' AGI lies in some combination of DL, probabilistic graph models, symbolic systems, and something we have not even imagined yet. BTW, a good paper just released on the limitations of DL by Judea Pearl <a href="https://arxiv.org/abs/1801.04016" rel="nofollow">https://arxiv.org/abs/1801.04016</a>
It would be interesting to know how well some of the entries on the Squad page do for the Winograd Schema challenge (<a href="https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS.html" rel="nofollow">https://cs.nyu.edu/faculty/davise/papers/WinogradSchemas/WS....</a>). Does anyone know if any of the systems have been tested on that as well?
I am always annoyed at claims in supervised learning that a machine predictor is better than humans. Humans obviously are the ones that scored the dataset to begin with. If you read the paper, it goes on to say, in regards to human evaluation:<p>> Mismatch occurs mostly due to inclusion/exclusion
of non-essential phrases (e.g., monsoon trough versus
movement of the monsoon trough) rather than
fundamental disagreements about the answer.<p>I don't think I would call that "error," rather than ambiguity. In other words, there's more than one possible answer to the questions under these criteria -- English isn't a formal grammar where there's always one and only one answer. For instance, here's one of the questions from the ABC Wikipedia page:<p>> What kind of network was ABC when it first began?<p>> Ground Truth Answers: radio network radio radio network<p>> Prediction: October 12, 1943<p>Because the second human said "radio" instead of "radio network," I believe this would count as a human miss. But the answer is factually correct. Meanwhile, the prediction from the Stanford logistic regression (not the more sophisticated Alibaba model in the article, where I don't think results are published at this detail) is completely wrong. No human could make that mistake. And yet these are treated as equally flawed answers by the EM metric.<p>And yet this gets headlined as "defeats humans," not "learns to mimic human responses well."
How well do these do on Winograd challenges?<p><a href="https://aaai.org/Conferences/AAAI-18/aaai18winograd/" rel="nofollow">https://aaai.org/Conferences/AAAI-18/aaai18winograd/</a>
This is clickbait. Unless models are invariant to adversarial examples in SQuAD such as those described here: <a href="https://arxiv.org/abs/1707.07328" rel="nofollow">https://arxiv.org/abs/1707.07328</a>, models doing really well on SQuAD doesn't mean a ton.
At NIPS 2017 there was a system which beat humans in a college QuizBowl competition. In many ways I think that was more impressive than excellent performance on SQuAD.
Kudos to my colleagues. The iDST team is based in Bellevue, WA and hiring more people. Let me know if you're interested.<p>Also, the Alibaba Cloud is looking for engineers.
Pls check <a href="https://careers.alibaba.com/positionDetail.htm?positionId=b7kSeJ8J2XQ3ynkotvAhPw%3D%3D" rel="nofollow">https://careers.alibaba.com/positionDetail.htm?positionId=b7...</a>
@syllogism, have you thought about a demo combining spaCy + ____ to tackle SQuAD (<a href="https://rajpurkar.github.io/SQuAD-explorer/" rel="nofollow">https://rajpurkar.github.io/SQuAD-explorer/</a>)?
A counterpoint from Yoav Goldberg:<p><a href="http://u.cs.biu.ac.il/~yogo/squad-vs-human.pdf" rel="nofollow">http://u.cs.biu.ac.il/~yogo/squad-vs-human.pdf</a>
Real link:<p><a href="http://www.zdnet.com/article/alibaba-neural-network-defeats-human-in-global-reading-test/" rel="nofollow">http://www.zdnet.com/article/alibaba-neural-network-defeats-...</a>