Thanks for sharing and very insightful. Guess the TPUs are the real deal. About 1/2 the cost for similar performance.<p>Would assume Google is able to do that because of the less power required.<p>I am actually more curious to get a paper on the new speech NN Google is using. Suppose to be 16k samples a second through a NN is hard to imagine how they did that and was able to roll it out as you would think the cost would be prohibitive.<p>You are ultimately competing with a much less compute heavy solution.<p><a href="https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-to-Speech-powered-by-Deepmind-WaveNet-technology.html" rel="nofollow">https://cloudplatform.googleblog.com/2018/03/introducing-Clo...</a><p>Suspect this was only possible because of the TPUs.<p>Can't think of anything else where controlling the entire stack including the silicon would be more important than AI applications.
Hi, author here. The motivation for this article came out of the HN discussion on a previous post (<a href="https://news.ycombinator.com/item?id=16447096" rel="nofollow">https://news.ycombinator.com/item?id=16447096</a>). There was a lot of valuable feedback - thanks for that.<p>Happy to answer questions!
Slower alternative: "fastai with @pytorch on @awscloud is currently the fastest to train Imagenet on GPU, fastest on a single machine (faster than Intel-caffe on 64 machines!), and fastest on public infrastructure (faster than @TensorFlow on a TPU!)
Big thanks to our students that helped with this." - <a href="https://twitter.com/jeremyphoward/status/988852083796291584" rel="nofollow">https://twitter.com/jeremyphoward/status/988852083796291584</a>
An important hidden cost here is coding a model which can take advantage of mixed-precision training. It is not trivial: you have to empirically discover scaling factors for loss functions, at the very least.<p>It's great that there is now wider choice of (pre-trained?) models formulated for mixed-precision training.<p>When I was comparing Titan V (~V100) and 1080ti 5 months ago, I was only able to get 90% increase in forward-pass speed for Titan V (same batch-size), even with mixed-precision. And that was for an attention-heavy model, where I expected Titan V to show its best. Admittedly, I was able to use almost double the batch-size on Titan V, when doing mixed-precision. And Titan V draws half the power of 1080ti too :)<p>At the end my conclusion was: I am not a researcher, I am a practitioner - I want to do transfer learning or just use existing pre-trained models - without tweaking them. For that, tensor cores give no benefit.
Nvidia is currently in cashing out phase. They have monopoly and money flows in effortlessly. The cost performance ratio reflects this.<p>AMD will enter the game soon once they get their software working, Intel will follow.<p>I suspect that Nvidia will respond with its own specialized machine learning and inference chips to match the cost/performance ratio. As long as Nvidia can maintain high manufacturing volumes and small performance edge, they can still make good profits.
>For GPUs, there are further interesting options to consider next to buying. For example, Cirrascale offers monthly rentals of a server with four V100 GPUs for around $7.5k (~$10.3 per hour). However, further benchmarks are required to allow a direct comparison since the hardware differs from that on AWS (type of CPU, memory, NVLink support etc.).<p>Can't you just buy some 1080s for cheaper than this. I understand there is electricity and hosting costs, but cloud computing seems expensive compared to buying equipment.
Excellent! Thanks for these numbers, I wanted to see exactly this kind of benchmarks! Do you plan to try different benchmarks with the same setup for different problems, like semantic segmentation, DenseNet, LSTM training performance etc. as well?
Excellent work. Do you have plans to open source the scripts/implementation details used to reproduce the results? Would be great if others can also validate and repeat the experiment for future software updates (e.g. TensorFlow 1.8) as I expect there will be some performance gain for both TPU and GPU by CUDA and TensorFlow optimizations.<p>Sidenote: Love the illustrations that accompany most of your blog posts, are they drawn by an in-house artist/designer?
What they're not saying is that one can't use all nvlink bandwidth for gradient reduction on a DGX-1V with only 4 GPUs because nvlink is composed of 2 8-node rings. And given the data parallel nature of this benchmark, I'm very interested in where time was spent on each architecture.<p>That said, they fixed this on NVSwitch so it's just another HW hiccup like int8 was on Pascal.
Thanks for this, just a minor thing:<p>You have price per hour and performance per second. Thus that ratio is not performance per image per $, you need to scale that. Also, the metric is not "images per second per $", but just "images per $".
How much detail do we know about the TPUs' design? Does Google disclose a block-diagram level? ISA details? Do they release a toolchain for low-level programming or only higher-level functions like TensorFlow?<p>EDIT: I found [1] which describes "tensor cores", "vector/matrix units" and HBM interfaces. The design sounds similar in concept to GPUs. Maybe they don't have or need interpolation hw or other GPU features?<p>[1] <a href="https://cloud.google.com/tpu/docs/system-architecture" rel="nofollow">https://cloud.google.com/tpu/docs/system-architecture</a>
Great work, RiseML. This benchmark is sincerely appreciated.<p>I wonder whether NVLink would make any difference for Resnet-50. Does anyone know whether these implementations require any inter-GPU communication?
Was this running the AWS Deep Learning AMI or did you build your own.<p>Because Intel was involved in its development and made a number of tweaks to improve performance.<p>Be curious if it actually was significant or not.
>For the V100 experiments, we used a p3.8xlarge instance (<i>Xeon E5–2686@2.30GHz 16 cores, 244 GB memory</i>, Ubuntu 16.04) on AWS with four V100 GPUs (16 GB of memory each). For the TPU experiments, we used a small n1-standard-4 instance as host (<i>Xeon@2.3GHz two cores, 15 GB memory</i>, Debian 9) for which we provisioned a Cloud TPU (v2–8) consisting of four TPUv2 chips (16 GB of memory each).<p>A bit odd that the TPUs are provisioned on such a weaker machine compared to the V100s, especially when there were comparisons which included augmentation and other processing outside of the TPU.