For some reason they focus on the inference, which is the computationally cheap part. If you're working on ML (as opposed to deploying someone else's ML) then almost all of your workload is training, not inference.
We did a big analysis of this a few years back. We ended up using a big spot-instance cluster of CPU machines for our inference cluster. Much more consistently available than spot GPU, at greater scale, and at better price per inference (at least at the time). Scaled well to many billion inferences. Of course, compare cost per inference on your models to make sure logic applies. Article on how it worked: <a href="https://www.freecodecamp.org/news/ml-armada-running-tens-of-billions-of-ml-predictions-on-a-budget-f9505c820203/" rel="nofollow">https://www.freecodecamp.org/news/ml-armada-running-tens-of-...</a><p>Training was always GPUs (for speed), non-spot-instance (for reliability), and cloud based (for infinite parallelism). Training work tended to be chunky, never made sense to build servers in house that would be idle some of the time, and queued at other times.
For small-scale transformer CPU inference you can use, e.g., Fabrice Bellard's <a href="https://bellard.org/libnc/" rel="nofollow">https://bellard.org/libnc/</a><p>Similarly, for small-scale convolutional CPU inference, where you only need to do maybe 20 ResNet-50 (batch size 1) per second per CPU (cloud CPUs cost $0.015 per hour) you can use inference engines designed for this purpose, e.g., <a href="https://NN-512.com" rel="nofollow">https://NN-512.com</a><p>You can expect about 2x the performance of TensorFlow or PyTorch.
I think TPU is the way to go for ML, be it training or inference.<p>We're using GPU(some contains a TPU block inside) due to 'historical reasons'. With vector unit(x86 AVX, ARM SVE, RISC-V RVV) that is part of the host cpu, either put a TPU on a separate die of the chiplet, or just put it into a PCIe card will do the heavy lift ML job fine. It shall be much cheaper than the GPU model for ML nowadays, unless you are both a PC game player and a ML engineer.
This also very much depends on the inference use case / context. For example, I work in deep learning on digital pathology where images can be up to 100000x100000pixels in size and inference needs GPUs as it's just way too slow otherwise.
Not related to the article, but how would one begin to become smart on optimizing GPU workloads? I've been charged with deploying an application that is a mixture of heuristic search and inference, that has been exclusively single-user to this point.<p>I'm sure every little thing I've discovered (e.g. measuring cpu/gpu workloads, trying to multiplex access to the gpu, etc) was probably covered in somebody's grad school notes 12 years ago, but I haven't found a source of info on the topic.
There are some pretty elegant solutions out there for the problem of having the right ratio of CPU to GPU. One of the nicer ones is rCUDA. <a href="https://scholar.google.com/citations?view_op=view_citation&hl=es&user=4XgrRlMAAAAJ&citation_for_view=4XgrRlMAAAAJ:zYLM7Y9cAGgC" rel="nofollow">https://scholar.google.com/citations?view_op=view_citation&h...</a>
> And CPUs are so much cheaper<p>Doesn't look like it. Consumer:<p>AMD ThreadRipper 3970X: ~3000 USD on NewEgg<p><a href="https://www.newegg.com/amd-ryzen-threadripper-2990wx/p/N82E16819113618?Description=AMD%20Ryzen&cm_re=AMD_Ryzen-_-19-113-618-_-Product&quicklink=true" rel="nofollow">https://www.newegg.com/amd-ryzen-threadripper-2990wx/p/N82E1...</a><p>NVIDIA RTX 3080 Ti Founders' Edition: ~2000 USD<p><a href="https://www.newegg.com/nvidia-900-1g133-2518-000/p/1FT-0004-006T6?Description=Geforce%20RTX%203080%20Ti%20Founders%20edition&cm_re=Geforce_RTX%203080%20Ti%20Founders%20edition-_-1FT-0004-006T6-_-Product&quicklink=true" rel="nofollow">https://www.newegg.com/nvidia-900-1g133-2518-000/p/1FT-0004-...</a><p>For servers, a comparison is even more complicated and it wouldn't be fair to just give two numbers, but I still don't think GPUs are more expensive.<p>... besides, none of that may matter if yours is a power budget.
What a clickbaity article. It’s an interesting discussion of GPU multiplexing for ML inference merged together with a sales pitch but the clickbait title made me hate the article bait and switch. This wasn’t even an example of Betteridge’s law but just completely misleading headline.
" It feels wasteful to have an expensive GPU sitting idle while we are executing the CPU portions of the ML workflow"<p>What is expensive? Those 3090ti's are looking very tasteful at current prices.
Perhaps it's been mentioned before but I do find it curious how often crypto mining was lambasted for contributing to climate change get I haven't seen anybody bat an eye at a fairly similar amount of compute power used for ML applications. Makes me wonder.
It depends a lot on your problem, of course.<p>Game-playing (e.g. AlphaGo) is computationally hard but the rules are immutable, target functions (e.g., heuristics) don’t change much, and you can generate arbitrarily sized clean data sets (play more games). On these problems, ML-scaling approaches work very well. For business problems where the value of data decays rapidly, though, you probably don’t need the power of a deer or complex neural net with millions of parameters, and expensive specialty hardware probably isn’t worth it.
Not only the end result of these deep learning models can be tricked over a single pixel or get confused by malicious input and becomes useless, Deep Learning training, retraining, fine tuning on GPUs, TPUs, all running in a data center contribute significantly to burning up the planet and driving up costs which the models are just used for nothing but surveillance on our own data.<p>If it doesn't work it has to be retrained on new data again and there are no efficient alternatives to this energy waste other than use more GPUs, TPUs, etc emitting more CO2 after years of Deep Learning existing.<p>A complete waste of resources and energy. Therefore it is not worth it at all.