This is such a weird bit of research to me. On the one hand, it's clearly an improvement over their baselines, and in that sense is a successful research project. Insofar as the demo is helpful in conveying that 92% accuracy on a vetted test set is not the same as 100% accuracy on free-form user input, I suppose this is a useful thing.<p>But at a higher level, the underlying task is just so ill-posed as to make this whole exercise pretty meaningless. Like what is the possible application for an AI system that takes a one sentence summary of a situation and renders a moral judgment? Even if it were 100% accurate on the test set, what does that even mean? Why is matching crowdsourced moral judgments a valuable goal?<p>It seems like the valuable insights from this research are more about the general task of integrating common sense reasoning into inference, and would have been better demonstrated using a less fraught task.