I wish AMD would just drop ROCm at this stage, and focus on SYCL. The rocRAND/hipRAND woes in this article are if anything showing ROCm in a better light than it really is; here it at least worked and performed within the same ballpark as CUDA. Often it simply does not work at all, or if it works it's behind by a lot more. At work I simply gave up on our 4x Radeon Pro W6800 workstation because launching Tensorflow with more than 1 GPU would cause a kernel panic every time, and AMD engineers never offered a fix other than "reinstall Ubuntu".<p>ROCm feels like such a half assed product that (to me at least) feels like it's been made to tick a box and look cool in corporate presentations. It's not made with the proper mindset to compete against CUDA. Lisa Su claims they're doubling down on ROCm but to me it feels like they're falling behind relative to Nvidia, not catching up.<p>Banding together with Intel to support SYCL would in my opinion<p>1. Ensure there's a lot more momentum behind a single, cross-platform, industry-standard competitor<p>2. Entice other industry heavyweights like MSFT, Qualcomm, ARM etc to also take the cross-platform solutions more seriously<p>3. Encourage heavy investment into the developer experience and tooling for the cross-platform solution
AMD attempted responses go all the way back to 2007 when CUDA first debuted with "Close to Metal" (<a href="https://en.wikipedia.org/wiki/Close_to_Metal" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Close_to_Metal</a>). They've had nearly 20 years to fix the situation and have failed to do so. Maybe some third party player like Lamini AI will do what they couldn't and get acquired for it.
That's more-or-less my experience with AMD, only worse. Critical thing too are burned developers like myself.<p>I'm looking forward to Intel v. NVidia. Arc A770 is a pretty serious competitor. It's the lowest-cost way to run OPT-175B.<p>Given a 7-slot motherboard, $270 * 7 = $1890 for 112GB of VRAM in one computer. That's sweet. Compute speed would be on-par with top-of-the-line NVidia workstation GPU.<p>Three of those are enough to run the largest open-source LLMs at around $9000.<p>We're just drivers + libraries + documentation away, and Intel is not bad at drivers + libraries + documentation.
“Please note the library is being actively developed, and is known to be incomplet; it might also be incorrekt and there could be a few bad bugs lurking.”<p>That gives me a good laugh.
This was exactly my experience with it too. It's moved on a bit since then but when I looked at rocFFT a couple of years ago, the documentation was really poor and it was missing features.<p>When I switched from FFTW to cuFFT many years ago (~2015), the transition was very smooth, the documentation was great, and all features were supported. They even put a shim "FFTW" compatible header file in so that you didn't need to rewrite your code to make it work (leaving some performance on the table).
I'm convinced that Julia is how the moat will be crossed. There are some pretty incredible GPU packages for it (I'm looking at you KernelAbstractions.jl). The Python science community seems more than happy to carry on focusing on NVIDIA and are a lost cause.<p>I somewhat don't blame them: the MI300X might be miles ahead and all, but AMD are not only oblivious to the desktop market (you know, where new ideas are prototyped) but are also seemingly actively hostile[1]. NVIDIA has people doing somewhat interesting things with a 3060 (which can eventually graduate to a 4090 or even a H100), while AMD don't what to hear about it unless you have a "pro" GPU. Definitely a case of dollar-wise and penny-foolish.<p>[1] <a href="https://rocm.docs.amd.com/en/docs-5.5.1/release/gpu_os_support.html" rel="nofollow noreferrer">https://rocm.docs.amd.com/en/docs-5.5.1/release/gpu_os_suppo...</a> * FWIW you can override this with an envar, but AMD aren't exactly forthcoming with that information.
Documentation. The python example is a bit on the nose. I've not had good times with ROCm's documentation either.<p>Can anyone point to an example of good documentation for a big software system where they can also sketch how that was achieved? E.g. Cuda's docs are pretty good but I've no idea how they came to be, or how they stay up to date. LLVM's docs are a small amount of handwritten webpages which correlate with reality to some extent, the source for which lives in the same repo as the code.<p>I have an idea that it needs to combine programmers writing some things, some testing infra to notice things like internal links will 404 and some non-developers writing things.<p>I started trying to document one of my own systems as freeform notes under obsidian and while it kind of works at the time it diverges from reality pretty quickly, and that's without trying to have anyone else working on either the docs or the system.<p>So what's the proper, established answer to this?
AMD cards + ROCm are used in top supercomputers for (non-deep learning) HPC. Why is this the case?<p>I understand that AMD GPUs offer better cost efficiency for F32 & F64 FLOPs, RAM, and wattage. But however, if ROCm is such a half baked piece, shouldn't that advantage be gone? What drives AMD adoption in the HPC space then?
I hate ROCm so much I can't even describe how much I suffered because of this pos software. It wasn't good 3 years ago but manageable now it's just I don't even know how they got it worse.<p>I really just wished my employer would give up on AMD for GPUs.
At this point, rather than chasing CUDA/cuDNN compatability, it would seem more productive for AMD to be targetting high level language support. Forget CUDA compatibility, and instead support work such as Mojo and MLIR.<p>It seems that in an ideal world PyTorch support for AMD wouldn't rely on ROCm, but rather be based on high level code compiled to MLIR with AMD target support, with this same MLIR representation supporting Mojo and any other languages such as Julia that want optimized AMD support.
Recent video of Lisa Su, good watch:<p><a href="https://twitter.com/TheSixFiveMedia/status/1737177221490450594" rel="nofollow noreferrer">https://twitter.com/TheSixFiveMedia/status/17371772214904505...</a>
The rocrand library did not have any real documentation at all until 2023. It's still pretty barebones, but the updates in article regarding the Python API seem to suggest that this is a work-in-progress.
Are there use-cases for LLMs assisting in developing this software? Basically, I'm wondering if LLMs for developing a GPU-API exist and how can they can accelerate development such that this "moat" becomes more of a river that other can join?