All the reasons:<p>[1] The compilers don’t produce great instructions;<p>[2] The drivers crash frequently: ML workloads feel experimental;<p>[3] Software adoption is getting there, but kernels are less optimized within frameworks, in particular because of the fracture between ROCm and CUDA. When you are a developer and you need to write code twice, one version won’t be as good, and it is the one with less adoption;<p>[4] StackOverflow mindshare is lesser. Debugging problems is thus harder, as fewer people have encountered them.<p>---<p>were crucial when we had enough supply of NVidia GPUs, but if demand described in <a href="https://gpus.llm-utils.org/nvidia-h100-gpus-supply-and-demand/" rel="nofollow noreferrer">https://gpus.llm-utils.org/nvidia-h100-gpus-supply-and-deman...</a><p>is real (450,000+ H100)<p>Software bottlenecks most likely will be addressed sometime soon
> (For context, Hotz raised $5M to improve RX 7900 XTX support and sell a $15K prebuilt consumer computer that runs 65B-parameter LLMs. A plethora of driver crashes later, he almost gave up on AMD.)<p>Again, I wish Hotz and TinyGrad the best, especially for training/experimentation on AMD, but I feel like Apache TVM and the Various MLIR efforts (like Pytorch MLIR, SHARK, Mojo) are much more promising for ML inference. Even triton in PyTorch is very promising, with an endorsement from AMD.