I am really puzzled by TPUs. I've been reading everywhere that TPUs are powerful and a great alternative to NVIDIA.<p>I have been playing with TPUs for a couple of months now, and to be honest I don't understand how can people use them in production for inference:<p>- almost no resources online showing how to run modern generative models like Mistral, Yi 34B, etc. on TPUs
- poor compatibility between JAX and Pytorch
- very hard to understand the memory consumption of the TPU chips (no nvidia-smi equivalent)
- rotating IP addresses on TPU VMs
- almost impossible to get my hands on a TPU v5<p>Is it only me? Or did I miss something?<p>I totally understand that TPUs can be useful for training though.
We've previously tried and almost always regretted the decision. I think the tech stack needs another 12-18 months to mature (doesn't help that almost all work ex Google is being done in torch).
They aren’t really an alternative to anything. For one thing they’re now often slower on per-accelerator basis than NVIDIA stuff. They’re cheaper, of course, but because of disparity in performance you’ll need to estimate cost per flop on your own particular workload. They are also more difficult and slower to develop against, and SWE cost is always an issue if you don’t own a money printer like Google. Furthermore, for advanced users who can do their own CUDA kernels or Triton, that too can unlock additional efficiency from GPU. Such capability can’t even be contemplated on the TPU side because you basically get a black box. Then there’s the issue of limited capacity, further exacerbated by the fact that this capacity is provided by a single supplier who is struggling to fulfill its internal needs (which is why you can’t get v5). You can’t just get TPUs elsewhere. You can’t get them under your desk for dev work either.<p>That said, it wouldn’t be too difficult to port most models to Jax, load in the existing weights, and export the result for serving. Should you bother? IMO, no, unless we’re talking really large scale inference. Your time and money are almost certainly better spent iterating on the models.
Apparently Midjourney uses it. GCP put out a press release a while ago: <a href="https://www.prnewswire.com/news-releases/midjourney-selects-google-cloud-to-power-ai-generated-creative-platform-301771558.html" rel="nofollow">https://www.prnewswire.com/news-releases/midjourney-selects-...</a>
We tried hard to move some of our inference workloads to TPUs at NLP Cloud, but finally gave up (at least for the moment) basically for the reasons you mention. We now only perform our fine-tunings on TPUs using JAX (see <a href="https://nlpcloud.com/how-to-fine-tune-llama-openllama-xgen-with-jax-on-tpu-gpu.html" rel="nofollow">https://nlpcloud.com/how-to-fine-tune-llama-openllama-xgen-w...</a>) and we are happy like that.<p>It seems to me that Google does not really want to sell TPUs but only showcase their AI work and maybe get some early adopters feedback. It must be quite a challenge for them to create a dynamic community around JAX and TPUs if TPUs stay a vendor locked-in product...
I tried to use a Google Coral. I have no idea how to make it work. I could follow a tutorial using tensorflow. I could not figure out how to use for anything else. Is there some way to run CUDA stuff on it? I always assumed it required someone with actual skills (not me). I have used CUDA stuff before, but more for mass calculation and simulation (for financial stuff). It is great when it works. I worked at a shop that had these Xeon Phi systems that worked great, but I had no clue how, and it only worked with their pre-canned tools.<p>Just as an example, over a decade ago I replaced a few cases filled with racks and a SAN that made up a compute cluster with one box (plus SAN) and a backup box (both boxes were basically the same in case one failed), but basically like dozens of servers were replaced by a two CPU box with a couple Tesla cards (probably one A100 later). The entire model had to be re-written, but it was not that bad. I wanted to do with AMD cards, but there was no easy way.<p>I would also say that modern networked has made all kinds of stuff more interesting (also lining Nvidia's pockets). Those TPU's do not make sense to me. I have no idea how to use them. They should release their version of CUDA.
TPUs are tightly coupled to JAX and the XLA compiler. If your model is based on Pytorch you can use a bridge to export your model to StableHLO and then compile it to a TPU accelerator. In theory the XLA compiler should be more performant than the Pytorch Inductor.
There's a cubesat using a Coral TPU for pose estimation.<p><a href="https://aerospace.org/article/aerospaces-slingshot-1-demonstrates-pathway-accelerating-space-innovation" rel="nofollow">https://aerospace.org/article/aerospaces-slingshot-1-demonst...</a>
To see memory consumption on the TPU while running on GKE you can look at kubernetes.io/node/accelerator/memory_used<p><a href="https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#metrics" rel="nofollow">https://cloud.google.com/kubernetes-engine/docs/how-to/tpus#...</a>
I've seen people connecting these to Raspberry Pis to run local LLMs but I'm not sure how effective it is. Check YouTube for some videos about it.<p>Speaking of SBCs, prior to the Raspberry Pi, I was looking at the Orange Pi 5 which has a Rockchip RK3588S with an NPU (Neural Processing Unit). This was the first I had heard of such a thing but I was curious how/what exactly it does. Unfortunately, there's very little support for Orange Pi & not a large community for it so I couldn't find any feedback on how well it worked or what it did.<p><a href="http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-5-Pro.html" rel="nofollow">http://www.orangepi.org/html/hardWare/computerAndMicrocontro...</a>