AMD may get across the CUDA moat

551 pointsby danzhengover 1 year ago

41 comments

omneityover 1 year ago

I was able to use ROCm recently with Pytorch and after pulling some hair it worked quite well. The Radeon GPU I had on hand was a bit old and underpowered (RDNA2) and it only supported matmul on fp64, but for the job I needed done I saw a 200x increase in it/s over CPU despite the need to cast everywhere, and that made me super happy.Best of all is that I simply set the device to `torch.device('cuda')` rather than openCL, which does wonders for compatibility and to keep code simple.Protip: Use the official ROCM Pytorch base docker image [0]. The AMD setup is so finicky and dependent on specific versions of sdk/drivers/libraries and it will be much harder to make work if you try to install them separately.[0]: <a href="https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/pytorch_install.html" rel="nofollow noreferrer">https://rocm.docs.amd.com/en/latest/how_to/pytorch_install/p...</a>

评论 #37795767 未加载

评论 #37796495 未加载

评论 #37797037 未加载

评论 #37799169 未加载

javchzover 1 year ago

CUDA is the only reason I have an Nvidia card, but if more projects start migrating to a more agnostic environment, I'll be really grateful.Running Nvidia in Linux isn't as much fun. Fedora and Debian can be incredibly reliable systems, but when you add an Nvidia card, I feel like I am back in Windows Vista with kernel crashes from time to time.

评论 #37797072 未加载

评论 #37798875 未加载

评论 #37795662 未加载

评论 #37795841 未加载

评论 #37798225 未加载

评论 #37797784 未加载

评论 #37795130 未加载

评论 #37801742 未加载

评论 #37811104 未加载

评论 #37795201 未加载

IronWolveover 1 year ago

Yup, thank the hobbyists. Pytorch is allowing other hardware. Stable diffusion working on m chips, intel arc, and Amd.Now what I'd like to see is real benchmarks for compute power. Might even get a few startups to compete in this new area.

评论 #37795509 未加载

评论 #37794471 未加载

评论 #37795318 未加载

withwarmupover 1 year ago

CUDA is the result of years of NVIDIA supporting the ecosystem, some people likes to complain because they bought hardware that was cheaper but can't use it for what they want to use it, when you buy NVIDIA, you aren't buying only the hardware, but the insane amount of work they have put into the ecosystem, the same goes for Intel, mkl and scikit-learn intelex aren't free to develop.AMD has the hardware but the support for HPC is non-existent outside of the joke that is bliss and AOCL.I really wish for more competitors to enter the market in HPC, but AMD has a shitload of work to do.

评论 #37798252 未加载

评论 #37798989 未加载

评论 #37797850 未加载

pamaover 1 year ago

There is only limited empirical evidence of AMD closing the gap that NVidia has created in the science or ML software. Even when considering pytorch only, the engineering effort to maintain specialized ROCm along with CUDA solutions is not trivial (think flashattention, or any customization that optimizes your own model). If your GPUs only need a simple ML workflow all times for a few years nonstop, maybe there exist corner cases where the finances make sense. It is hard for AMD now to close the gap across the scientific/industrial software base of CUDA. NVidia feels like a software company for the hardware they produce; luckily they make the money from hardware thus cannot lock the software libraries.(Edited “no” to limited empirical evidence after a fellow user mentioned El Capitan.)

评论 #37796700 未加载

评论 #37796128 未加载

评论 #37802513 未加载

Roark66over 1 year ago

I think the article claiming "PyTorch has dropped the drawbridge on the CUDA moat" is way over optimistic. Jest pytorch is widely used by researchers and by users to quickly iterate various over various ways to use the models, but when it comes to inference there are huge gains to be had by going a different route. Llama.cpp has showed 10x speedups on my hardware for example (32gb of gpu ram + 32gb of cpu ram)for models like falcon-40b-instruct, for much smaller models on the cpu (under 10b) I saw up to 3x speedup just by switching to onnc and openvino.Apple has showed us in practice the benefits of CPU/GPU memory sharing, will AMD be able to follow in their footsteps? The article claims AMD has a design with up to 192gb of shared ram. Apple is already shipping a design with the same amount of RAM(if you can afford it). I wish them-and) success, but I believe they need to aim higher than just matching apple in some unspecified future.

bigcat12345678over 1 year ago

Cuda is the foundationNVIDIA moat is the years of work built by oss community, big corporations, research insistuteThey spend all time building for cuda, a lot of implicit designs are derived from cuda's characteristicThat will be the main challenge

评论 #37795837 未加载

pixelesqueover 1 year ago

Does AMD have a solution to forward device combatibility (like PTX for NVidia)?Last time I looked into ROCm (two years ago?), you seemed to have to compile stuff explicitly for the architecture you were using, so if a new card came out, you couldn't use it without a recompile.

评论 #37795173 未加载

评论 #37797362 未加载

nabla9over 1 year ago

> Crossing the CUDA moat for AMD GPUs may be as easy as using PyTorch.Nvidia has spent huge amount of work to make code run smoothly and fast. AMD has to work hard to catch up. ROCm code is slower , has more bugs, don't have enough features and they have compatibility issues between cards.

评论 #37795780 未加载

评论 #37797353 未加载

RcouF1uZ4gsCover 1 year ago

I am not so sure.Everyone knows that CUDA is a core competency of Nvidia and they have stuck to it for years and years refining it, fixing bugs, and making the experience smoother on Nvidia hardware.On the other hand, AMD has not had the same level of commitment. They used to sing the praises of OpenCL. And then there is ROCm. Tomorrow, it might be something else.Thus, Nvidia CUDA will get a lot more attention and tuning from even the portability layers because they know that their investment in it will reap dividends even years from now, whereas their investment in AMD might be obsolete in a few years.In addition, even if there is theoretical support, getting specific driver support and working around driver bugs is likely to be more of a pain with AMD.

评论 #37795188 未加载

hot_grilover 1 year ago

People complain about Nvidia being anticompetitive with CUDA, but I don't really see it. They saw a gap in the standards for on-GPU compute and put tons of effort into a proprietary alternative. They tied CUDA to their own hardware, which sorta makes technical sense given the optimizations involved, but it's their choice anyway. They still support the open standards, but many prefer CUDA and will pay the Nvidia premium for it because it's actually nicer. They also don't have CPU marketshare to tie things to.Good for them. We can hope the open side catches up either by improving their standards, or adding more layers like this article describes.

评论 #37796593 未加载

binarymaxover 1 year ago

And the question for most that remains once AMD catches up: will the duopoly result in lower prices to a reasonable level for hobbyists or bootstrapped startups, or will AMD just gouge like NVidia?

评论 #37795175 未加载

评论 #37794608 未加载

评论 #37797761 未加载

评论 #37795365 未加载

评论 #37798272 未加载

评论 #37794781 未加载

评论 #37795470 未加载

评论 #37797930 未加载

评论 #37794886 未加载

superkuhover 1 year ago

>There is also a version of PyTorch that uses AMD ROCm, an open-source software stack for AMD GPU programming. Crossing the CUDA moat for AMD GPUs may be as easy as using PyTorch.Unfortunately since the AMD firmware doesn't reliably do what it's supposed to those ROCm calls often don't either. That's if your AMD card is even still supported by ROCm: the AMD RX 580 I bought in 2021 (the great GPU shortage) had it's ROCm support dropped in 2022 (4 years support total).The only reliable interface in my experience has been via opencl.

评论 #37794495 未加载

评论 #37794615 未加载

评论 #37799236 未加载

the__alchemistover 1 year ago

When coding using Vulkan, for graphics or compute (The latter is the relevant one here), you need to have CPU code (Written in C++, Rust etc), then serialize it as bytes, then have shaders which run on the graphics card. This 3-step process creates friction, much in the same way as backend/serialization/frontend does in web dev. Duplication of work, type checking not going across the bridge, the shader language being limited etc.My understanding is CUDA's main strength is avoiding this. Do you agree? Is that why it's such a big deal? Ie, why this article was written, since you could always do compute shaders on AMD etc using Vulkan.

mark_l_watsonover 1 year ago

NVidia hardware/CUDA stack is great, but I also love to see competition from AMD, George Hotz’s Tiny Corp, etc.Off topic, but I am also looking with great interest at Apple Silicon SOCs with large internal RAM. The internal bandwidth also keeps getting better which is important for running trained LLMs.Back on topic: I don’t own any current Intel computers but using Colab and services like Lambda Labs GPU VPSs is simple and flexible. A few people here mentioned if AMD can’t handle 100% of their workload they will stick with Intel and NVidia - understandable position, but there are workarounds.

physicsguyover 1 year ago

Don’t agree at all. PyTorch is one library - yes, it’s important that it supports AMD GPUs but it’s not enough.The ROCm libraries just aren’t good enough currently. The documentation is poor. AMD need to heavily invest in their software ecosystem around it, because library authors need decent support to adopt it. If you need to be a Facebook sized organisation to write an AMD and CUDA compatible library then the barrier to entry is too high.

评论 #37801119 未加载

aleccoover 1 year ago

Regurgitated months-old content. blogspam

risover 1 year ago

I don't understand the author's argument (if there is one) - pytorch has existed for ages. AMD's Instinct MI* range has existed for years now. If these are the key ingredients why has it not already happened?

fluxemover 1 year ago

I call it the 90% problem. If AMD works for 90% of my projects, I would still buy NVIDIA, which works for 100%, even though I’m paying a premium

评论 #37795966 未加载

nologic01over 1 year ago

If the AI hype persists the CUDA moat will be less relevant in ~2 yrs.Historically HPC was simply not sufficiently interesting (in commercial sense) for people to throw serious resources in the direction of making it a mass market capability.NVIDIA first capitalized on the niche crypto industry (which faded) and was then well positioned to jump into the AI hype. The question is how much of the hype will become real business.The critical factor for the post-CUDA world is not any circumstantial moat but who will be making money servicing stable, long term computing needs. I.e., who will be buying this hardware not with speculative hot money but with cashflow from clients that regularly use and pay for a HPC-type application.These actors will be the long term buyers of commercially relevant HPC and they will have quite a bit of influence on this market.

ddtaylorover 1 year ago

It's worth noting that AMD also has a ROCm port of Tensorflow.

评论 #37794877 未加载

sharonzhouover 1 year ago

ROCm is great. We were able to get run and finetune LLMs on AMD Instincts with parity to NVIDIA A100s - and built an SDK that’s as easy to use as HuggingFace or easier (Lamini). Or at the very least, our designer is able to finetune/train the latest LLMs on them like Llama 2 - 70B and Mistral 7B with ease. The ROCm library isn’t as easy to use as CUDA because as another poster said, the ecosystem was built around CUDA. For example, it’s even called “.cuda()” in PyTorch to put a model on a GPU, when in reality you’d use it for an AMD GPU too.

atemerevover 1 year ago

Nope. PyTorch is not enough, you have to do come C++ occasionally (as the code there can be optimized radically, as we see in llama.cpp and the like). ROCm is unusable compared to CUDA (4x more code for the same problem).I don't understand why everyone neglects good, usable and performant lower-level APIs. ROCm is fast, low-level, but much much harder to use than CUDA, and the market seems to agree.

voz_over 1 year ago

The amount of random wrong stuff about pytorch in this thread is pretty funny.

whywhywhywhyover 1 year ago

Anyone who has to work in this ecosystem surely thinks this is a naive take

评论 #37796277 未加载

benreesmanover 1 year ago

I know a lot of people don’t like George, I dislike plenty of people who are doing the right thing thing (including by some measures sama and siebel while they were pushing YC forward).But not admitting the tinygrad project is the best Rebel Alliance on this is just a matter of letting vibe overcome results.

frnkngover 1 year ago

As a former ETH miner I learned the hard way that saving a few bucks on hardware may not be worth operational issues.I had a miner running with Nividia cards and a miner running with AMD cards. One of them had massive maintenance demand and the other did not. I will not state which brand was better imho.Currently I estimate that running miners and running gpu servers has similar operational requirements and finally at scale similar financial considerations.So, whatever is cheapest to operate in terms of time expenditure, hw cost, energy use,… will be used the most.P.s.: I ran the mining operation not to earn money but mainly out of curiosity. And it was a small scale business powered by a pv system and a attached heat pump.

评论 #37795760 未加载

pjmlpover 1 year ago

Unless they get their act together regarding CUDA polyglot tooling, I seriously doubt it.

ElectronBadgerover 1 year ago

On my PC workstation (Debian Testing) I have absolutely no problems running NVIDIA PNY Quadro P2200, which I'm going to upgrade with PNY Quadro RTX 4000 soon. I'd love to make a switch for AMD Radeon, but the very short (and shrinking) list of ROCm supported cards makes this move highly improbable for the not-so-nearest future.

upbeat_generalover 1 year ago

This article doesn’t address the real challenge [in my mind].Framework support is one thing, but what about the million standalone CUDA kernels that have been written, especially common in research. Nobody wants to spend time re-writing/porting those, especially when they probably don’t understand the low-level details in the first place.Not to mention, what is the plan for comprehensive framework support? I’ve experienced the pain of porting models to different hardware architectures where various ops are unsupported. Is it realistic to get full coverage of e.g., PyTorch?

评论 #37799276 未加载

评论 #37799740 未加载

hankman86over 1 year ago

I suspect that AMD will use their improved compatibility with the leading ML stack for data center deals. Presumably by offering steep discounts over NVIDIA’s GPUs. This might help them to break into the market.Individual ML practitioners will probably not be tempted to switch to AMD cards anytime soon. Whatever the price difference is: it will hardly offset the time that is subsequently sunk into working around remaining issues resulting from a non-CUDA (and less mature) stack underneath PyTorch.

falconroarover 1 year ago

Is there any reason OpenCL is not the standard in implementations like PyTorch? Similar performance, open standard, runs everywhere - what's the downside?

评论 #37802542 未加载

评论 #37811488 未加载

tails4eover 1 year ago

AMD playing catch up is a good thing, their SW solution is intended to run on any HW, and with hip being basically line for line compatible with cuda it makes porting very easy. They did it with FSR,and they are doing it with rocm. Hopefully it takes off as it's a more open ecosystem for the industry. Necessity is the mother of invention and all that.

tormehover 1 year ago

For LLM inference, a shoutout to MLC LLM, which runs LLM models on basically any API that's widely available: <a href="https://github.com/mlc-ai/mlc-llm">https://github.com/mlc-ai/mlc-llm</a>

einpoklumover 1 year ago

TL;DR:1. Since PyTorch has grown very popular, and there's an AMD backend for that, one can switch GPU vendors when doing Generative AI work.2. Like NVIDIA's Grace+Hopper CPU-GPU combo, AMD is/will be offering "Instinct MI300A", which improves performance over having the GPU across a PCIe bus from a regular CPU.

ur-whaleover 1 year ago

> AMD May Get Across the CUDA MoatI really wish they would, and properly, as in: fully open solution to match CUDA.CUDA is a cancer on the industry.

评论 #37799974 未加载

raggiover 1 year ago

Can we just get wgsl compute good enough and over the line instead, and do away with these moats?

评论 #37800034 未加载

jeffreygoestoover 1 year ago

I am hoping for SYCL and SPIR-V to gain traction...

jiggawattsover 1 year ago

Can I buy an MI300 or even rent one in a cloud?

评论 #37798268 未加载

spandextwinsover 1 year ago

That’s like saying Ford is gonna catch Tesla.

评论 #37798035 未加载

评论 #37798076 未加载

Zetobalover 1 year ago

They are just too late even if they catch up. Until they make a leap like they did with ryzen nothing will happen.

评论 #37794869 未加载