Here is a larger-scale comparison of Cloud TPU and Google Cloud GPU performance and cost (focused on Cloud TPU Pods):
<a href="https://cloud.google.com/blog/products/ai-machine-learning/now-you-can-train-ml-models-faster-and-lower-cost-cloud-tpu-pods" rel="nofollow">https://cloud.google.com/blog/products/ai-machine-learning/n...</a><p>All the code used in that comparison is open source, and there is a detailed methodology page with instructions that you can follow if you want to reproduce the results:
<a href="https://github.com/tensorflow/tpu/blob/master/benchmarks/ResNet-50_v1.5_Performance_Comparison_TensorFlow_1.12_GCP.md" rel="nofollow">https://github.com/tensorflow/tpu/blob/master/benchmarks/Res...</a><p>Also, Cloud TPUs are available to everyone for free via Colab. Here is a sample Colab that shows how to train a Keras model on the Fashion MNIST dataset using the Adam optimizer:
<a href="https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/fashion_mnist.ipynb" rel="nofollow">https://colab.research.google.com/github/tensorflow/tpu/blob...</a><p>(I work on Cloud TPUs)
This doesn't seem like a very informative benchmark to me. They don't mention how/whether they tuned the learning rates and batch sizes to optimize for each different device. Like they mention, they also use a very small network that isn't something you need the power of a TPU to train quickly and may scale differently than a large network.<p>They also don't post their code so I can't check that their problems with ADAM aren't due to using L2 regularization, which <a href="https://arxiv.org/abs/1711.05101" rel="nofollow">https://arxiv.org/abs/1711.05101</a> shows leads to worse performance than SGD and you should use weight decay instead.
Hmm interesting, is it possible that the batch size for the tpu is larger? I'm guessing they might be using some sort of large batches to populate their giant GEMM cores