An early version of http://arxiv.org/abs/1608.07249v5 appeared here a month ago. Judging from the changelog (Section 7.1) the authors have fixed many issues. The results differ quite a bit from the original version. On the high end GPUs (GTX 1080) it seems that CNTK is the best for LSTMs and Fully Connected Nets. Torch is the best for Resnet 50 and Caffe is the best for Alexnet.
Glad to see this work was updated so that the comparisons are now equivalent. I would like to see their next paper on multi-GPU. CNTK would likely do quite there as well. Note: I work at MSFT.