Comparing Google’s TPUv2 against Nvidia’s V100 on ResNet-50

171 pointsby henningpetersabout 7 years ago

14 comments

Thanks for sharing and very insightful. Guess the TPUs are the real deal. About 1/2 the cost for similar performance.Would assume Google is able to do that because of the less power required.I am actually more curious to get a paper on the new speech NN Google is using. Suppose to be 16k samples a second through a NN is hard to imagine how they did that and was able to roll it out as you would think the cost would be prohibitive.You are ultimately competing with a much less compute heavy solution.<a href="https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-to-Speech-powered-by-Deepmind-WaveNet-technology.html" rel="nofollow">https://cloudplatform.googleblog.com/2018/03/introducing-Clo...</a>Suspect this was only possible because of the TPUs.Can't think of anything else where controlling the entire stack including the silicon would be more important than AI applications.

评论 #16931949 未加载

评论 #16932998 未加载

评论 #16935324 未加载

评论 #16934510 未加载

elmarhaussmannabout 7 years ago

Hi, author here. The motivation for this article came out of the HN discussion on a previous post (<a href="https://news.ycombinator.com/item?id=16447096" rel="nofollow">https://news.ycombinator.com/item?id=16447096</a>). There was a lot of valuable feedback - thanks for that.Happy to answer questions!

评论 #16933114 未加载

评论 #16933200 未加载

评论 #16931939 未加载

评论 #16932148 未加载

评论 #16945076 未加载

评论 #16932099 未加载

评论 #16932653 未加载

评论 #16938121 未加载

zmartyabout 7 years ago

Slower alternative: "fastai with @pytorch on @awscloud is currently the fastest to train Imagenet on GPU, fastest on a single machine (faster than Intel-caffe on 64 machines!), and fastest on public infrastructure (faster than @TensorFlow on a TPU!) Big thanks to our students that helped with this." - <a href="https://twitter.com/jeremyphoward/status/988852083796291584" rel="nofollow">https://twitter.com/jeremyphoward/status/988852083796291584</a>

评论 #16933917 未加载

dimitry12about 7 years ago

An important hidden cost here is coding a model which can take advantage of mixed-precision training. It is not trivial: you have to empirically discover scaling factors for loss functions, at the very least.It's great that there is now wider choice of (pre-trained?) models formulated for mixed-precision training.When I was comparing Titan V (~V100) and 1080ti 5 months ago, I was only able to get 90% increase in forward-pass speed for Titan V (same batch-size), even with mixed-precision. And that was for an attention-heavy model, where I expected Titan V to show its best. Admittedly, I was able to use almost double the batch-size on Titan V, when doing mixed-precision. And Titan V draws half the power of 1080ti too :)At the end my conclusion was: I am not a researcher, I am a practitioner - I want to do transfer learning or just use existing pre-trained models - without tweaking them. For that, tensor cores give no benefit.

评论 #16937974 未加载

评论 #16936267 未加载

Nokinsideabout 7 years ago

Nvidia is currently in cashing out phase. They have monopoly and money flows in effortlessly. The cost performance ratio reflects this.AMD will enter the game soon once they get their software working, Intel will follow.I suspect that Nvidia will respond with its own specialized machine learning and inference chips to match the cost/performance ratio. As long as Nvidia can maintain high manufacturing volumes and small performance edge, they can still make good profits.

评论 #16932404 未加载

samfisher83about 7 years ago

>For GPUs, there are further interesting options to consider next to buying. For example, Cirrascale offers monthly rentals of a server with four V100 GPUs for around $7.5k (~$10.3 per hour). However, further benchmarks are required to allow a direct comparison since the hardware differs from that on AWS (type of CPU, memory, NVLink support etc.).Can't you just buy some 1080s for cheaper than this. I understand there is electricity and hosting costs, but cloud computing seems expensive compared to buying equipment.

评论 #16932403 未加载

评论 #16932282 未加载

评论 #16932957 未加载

bitLabout 7 years ago

Excellent! Thanks for these numbers, I wanted to see exactly this kind of benchmarks! Do you plan to try different benchmarks with the same setup for different problems, like semantic segmentation, DenseNet, LSTM training performance etc. as well?

评论 #16932014 未加载

kyloonabout 7 years ago

Excellent work. Do you have plans to open source the scripts/implementation details used to reproduce the results? Would be great if others can also validate and repeat the experiment for future software updates (e.g. TensorFlow 1.8) as I expect there will be some performance gain for both TPU and GPU by CUDA and TensorFlow optimizations.Sidenote: Love the illustrations that accompany most of your blog posts, are they drawn by an in-house artist/designer?

评论 #16933698 未加载

评论 #16933722 未加载

scottlegrand2about 7 years ago

What they're not saying is that one can't use all nvlink bandwidth for gradient reduction on a DGX-1V with only 4 GPUs because nvlink is composed of 2 8-node rings. And given the data parallel nature of this benchmark, I'm very interested in where time was spent on each architecture.That said, they fixed this on NVSwitch so it's just another HW hiccup like int8 was on Pascal.

评论 #16937961 未加载

drejabout 7 years ago

Thanks for this, just a minor thing:You have price per hour and performance per second. Thus that ratio is not performance per image per $, you need to scale that. Also, the metric is not "images per second per $", but just "images per $".

评论 #16932431 未加载

wyldfireabout 7 years ago

How much detail do we know about the TPUs' design? Does Google disclose a block-diagram level? ISA details? Do they release a toolchain for low-level programming or only higher-level functions like TensorFlow?EDIT: I found [1] which describes "tensor cores", "vector/matrix units" and HBM interfaces. The design sounds similar in concept to GPUs. Maybe they don't have or need interpolation hw or other GPU features?[1] <a href="https://cloud.google.com/tpu/docs/system-architecture" rel="nofollow">https://cloud.google.com/tpu/docs/system-architecture</a>

评论 #16932098 未加载

评论 #16931982 未加载

评论 #16933611 未加载

twtwabout 7 years ago

Great work, RiseML. This benchmark is sincerely appreciated.I wonder whether NVLink would make any difference for Resnet-50. Does anyone know whether these implementations require any inter-GPU communication?

评论 #16932392 未加载

threeseedabout 7 years ago

Was this running the AWS Deep Learning AMI or did you build your own.Because Intel was involved in its development and made a number of tweaks to improve performance.Be curious if it actually was significant or not.

评论 #16931781 未加载

Tenokeabout 7 years ago

>For the V100 experiments, we used a p3.8xlarge instance (Xeon E5–2686@2.30GHz 16 cores, 244 GB memory, Ubuntu 16.04) on AWS with four V100 GPUs (16 GB of memory each). For the TPU experiments, we used a small n1-standard-4 instance as host (Xeon@2.3GHz two cores, 15 GB memory, Debian 9) for which we provisioned a Cloud TPU (v2–8) consisting of four TPUv2 chips (16 GB of memory each).A bit odd that the TPUs are provisioned on such a weaker machine compared to the V100s, especially when there were comparisons which included augmentation and other processing outside of the TPU.