The summaries of the case studies in Section 2 was disturbing, to say the least. I sometimes think about the NIPS consistency experiment from a few years ago, which I think also speaks very negatively of the current state of academic ML research. I agreed with all of the authors' suggestions to solve this mess and I think many of them need to be implemented urgently.
The suggestions for empirical evaluation are great, but would have to be insisted upon by reviewers, often chosen by authors and complicit in promoting the larger methodology at stake. I’ve seen this happen in methodology publications several times.<p>It makes sense though for companies to do this evaluation, and a ML PaaS that automates those empirical evaluations across a range of methods might have a uniquely useful service.