Disclosure: I work on Google Cloud.<p>While not perfect, I want to commend the RiseML folks for doing not only an “just out of the box” run in both regular and fp16 mode (for V100), but also adding their own LSTM experiment to the mix. We need third-party benchmarks whenever new hardware or software are being sold by vendors (reminder: I benefit from you buying Google Cloud!).<p>I hope the authors are able to collect some of the feedback here and update their benchmark and blog post. The question about batch size comparisons is probably the most direct, but like others, I’d encourage a run on 1, 2, 4 and 8 V100s as well.
Google claim 29x better performance-per-Watt with TPUs than contemporary GPUs[0]. Interesting to contrast that to the images-per-$ figure in this post, which is more like 2x.<p>I assume there's a high capital cost for this new hardware, but when they scale it up I wonder if the ratio of cost TPU to GPU will trend towards the ratio of power-per-Watt between the platforms? Seems like a natural limit, even if it never quite gets there.<p>[0] <a href="https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu" rel="nofollow">https://cloud.google.com/blog/big-data/2017/05/an-in-depth-l...</a>
[Edited] The top line results focus on comparing four TPUs in a rack node (which marketing cleverly named “one cloud TPU”), running ~16 bit mixed precision, to one GPU (out of 8 in a rack node), also capable of 16 bit or mixed precision, but handicapped to 32 bit IEEE 754. That is a misleading comparison. Images/$ are obviously more directly comparable, but again the emphasized comparisons are at different precision. Very different batch sizes make this significantly more misleading, still. Images/$ also only tells us that Google has chosen to look at the competition and set a competitive price; the per-die or per-package comparison is much more relevant to understand any intrinsic architectural advantage, since these are all large dies on roughly comparable process nodes.
The bar graph seems a little whacky. It groups the TPU (which can only do FP16) with the FP32 results from the GPUs, then puts the FP16 GPU results off to the side even though that's much closer to what the TPU is doing.<p>Impressive results regardless though; quite a bit faster than V100 than the paper specs would suggest.
Just to clarify, is this benchmark leveraging mixed-precision mode on the Volta V100? The major innovation of the Volta generation is mixed-precision which NVIDIA claims is a huge performance increase over the Pascal generation (P100 in the case of your benchmark).<p>Link to NVIDIA documentation on mixed-precision TensorCores: <a href="https://devblogs.nvidia.com/inside-volta/" rel="nofollow">https://devblogs.nvidia.com/inside-volta/</a>
Specialization brings speedups.<p>TPUv2 is specially optimized for deep learning.<p>Nvidia's Volta microarchitecture is graphics processor with additional tensor units. It's a General-purpose (GPGPU) chip designed with graphics and other scientific computing tasks in mind. Nvidia has enjoyed monopoly power in the market and single microarchitecture has been enough in every high performance category.<p>Next logical step for Nvidia is to develop specialized deep learning TPU to compete with TPUv2 and others.
The entire idea that people are going to gain some huge advantage over nvidia with hardware softmax seems dubious. I do think it will buy them some time but eventually it seems as though nvidia will win this one.
I'd be interested how the superior perf/watt claims holds in googles practical setup. The additional Networking gear and power supply losses and so on might make the difference less.<p>I'm also not sure how we can take googles word for the numbers, since they might as well be eating a less-than-ideal power cost to promote their platform. Any upfront cost will probably offset by locked-in customers later on.<p>I might just be a bit cynical though.
IIRC, TPUv2 uses 16 bit floating point in some format with higher dynamic range and lower precision than standard fp16. Can someone confirm?<p>If that is right, is the "Tensorflow-optimized" Resnet-50 using 16bit floats when running on TPUv2?
> In order to efficiently use TPUs, your code should build on the high-level Estimator abstraction.<p>Does this mean it's inference-only? (I only quickly scanned the article)
I wonder if Chinese companies will use (or be allowed to use) TPUs. It seems like a pretty obvious way to have the NSA scoop up any Chinese AI advancements China may want to keep secret.
It is hard for Google to make money on these TPUs as the whole engineering cost has to be made back from its pricing on Google Cloud, where as with NVIDIA it can pay back its engineering costs via multiple mature channels (games, super computers, and multiple cloud providers.)<p>I wonder which is higher, the cost for creating the TPUs in terms of engineering and manufacturing or the cost differential in terms of usage as compared to NVIDIA's latest?<p>I worry about Google long term here. I am surprised the TPU doesn't kick the ass of the NVIDIA chips.