From first looks, there is little doubt that that NVIDIA's Volta architecture is a monster and will revolutionize the AI and HPC market. But the article seems to avoid quantifying how 16-point FP operations are beneficial against 32- or 64-bit FP operations in real-world usecases, or how the Caffe2 / NVIDIA architecture provides any significant boost to FP16 in particular, especially apropos to images (or why FP16 is better for image data in general).<p>I'm interested more in understanding why Caffe2 would outperform Theano, Tensorflow, MXNet, etc. once Volta chipsets are generally available, beyond early pre-release optimization, particularly when most of the front-runners are already leveraging / taking into account NCCL, CuDNN, NVLink, etc. When the burden of adding support for new NVIDIA primitives is so low, what gives Caffe2 an advantage beyond an ephemeral "we were partners with NVIDIA first" one-up that would last for a couple of months at most?<p>(Apologies in advance if this post sounds overly negative, but I am constantly evaluating the current crop of frameworks for the trade-offs they enforce on the problem space, and a definitive answer would be very helpful.)