Hacked Nvidia 4090 GPU driver to enable P2P

829 点作者 nikitml大约 1 年前

38 条评论

andersa大约 1 年前

Incredible! I'd been wondering if this was possible. Now the only thing standing in the way of my 4x4090 rig for local LLMs is finding time to build it. With tensor parallelism, this will be both massively cheaper and faster for inference than a H100 SXM.I still don't understand why they went with 6 GPUs for the tinybox. Many things will only function well with 4 or 8 GPUs. It seems like the worst of both worlds now (use 4 GPUs but pay for 6 GPUs, don't have 8 GPUs).

评论 #40015272 未加载

评论 #40013494 未加载

评论 #40014382 未加载

评论 #40014157 未加载

评论 #40024825 未加载

评论 #40013602 未加载

评论 #40014481 未加载

评论 #40013415 未加载

chriskanan大约 1 年前

This is great news. As an academic, I'm aware of multiple labs that built boxes with 4090s, not realizing that Nvidia had impaired P2P communication among cards. It's one of the reasons I didn't buy 4090s, despite them being much more affordable for my work. It isn't nvlink, but Nvidia has mostly gotten rid of that except for their highest end cards. It is better than nothing.Late last year, I got quotes for machines with four nvlink H100s, but the lead time for delivery was 13 months. I could get the non-nvlink ones in just four months. For now, I've gone with four L40S cards to hold my lab over but supply chain issues and gigantic price increases are making it very hard for my lab to do it's work. That's not nearly enough to support 6 PhD students and a bunch of undergrads.Things were a lot easier when I could just build machines with two GPUs each with Nvlink for $5K each and give one to each student to put under their desks, which is what I did back in 2015-2018 at my old university.

评论 #40032037 未加载

评论 #40033098 未加载

jstanley大约 1 年前

What does P2P mean in this context? I Googled it and it sounds like it means "peer to peer", but what does that mean in the context of a graphics card?

评论 #40011694 未加载

评论 #40011682 未加载

评论 #40013903 未加载

评论 #40011680 未加载

评论 #40013120 未加载

评论 #40011691 未加载

评论 #40016187 未加载

userbinator大约 1 年前

I wish more hardware companies would publish more documentation and let the community figure out the rest, sort of like what happened to the original IBM VGA (look up "Mode X" and the other non-BIOS modes the hardware is actually capable of - even 800x600x16!) Sadly it seems the majority of them would rather tightly control every aspect of their products' usage since they can then milk the userbase for more $$$, but IMHO the most productive era of the PC was also when it was the most open.

评论 #40013286 未加载

评论 #40016272 未加载

评论 #40032353 未加载

评论 #40014441 未加载

No1大约 1 年前

The original justification that Nvidia gave for removing Nvlink from the consumer grade lineup was that PCIe 5 would be fast enough. They then went on to release the 40xx series without PCIe 5 and P2P support. Good to see at least half of the equation being completed for them, but I can’t imagine they’ll allow this in the next gen firmware.

HPsquared大约 1 年前

Is this one of those features that's disabled on consumer cards for market segmentation?

评论 #40013182 未加载

评论 #40014215 未加载

评论 #40021738 未加载

评论 #40012281 未加载

ivanjermakov大约 1 年前

I was always fascinated by George Hotz's hacking abilities. Inspired me a lot for my personal projects.

评论 #40013397 未加载

评论 #40012299 未加载

评论 #40012381 未加载

评论 #40013039 未加载

llm_trw大约 1 年前

Skimming the readme this is p2p over PCIe and not NVLink in case anyone was wondering.

评论 #40011303 未加载

评论 #40011151 未加载

jsheard大约 1 年前

It'll be nice while it lasts, until they start locking this down in the firmware instead on future architectures.

评论 #40013713 未加载

jagrsw大约 1 年前

Was it George himself, or a person working for a bounty that was set up by tinycorp?Also, a question for those knowledgeable about the PCI subsys: it looked like something NVIDIA didn't care about, rather than something they actively wanted to prevent, no?

评论 #40013948 未加载

评论 #40011139 未加载

评论 #40012035 未加载

评论 #40011557 未加载

rfoo大约 1 年前

Glad to see that geohot is back being geohot, first by dropping a local DoS for AMD cards, then this. Much more interesting :p

评论 #40013100 未加载

评论 #40012557 未加载

modeless大约 1 年前

What are the chances that Nvidia updates the firmware to disable this and prevents downgrading with efuses? Someday cards that still have older firmware may be more valuable. I'd be cautious upgrading drivers for a while.

xipho大约 1 年前

You can watch this happen on the weekends, typically, sometimes, for some very long sessions, sometimes. <a href="https://www.twitch.tv/georgehotz" rel="nofollow">https://www.twitch.tv/georgehotz</a>

klohto大约 1 年前

fyi should work on most 40xx[1][1] <a href="https://github.com/pytorch/pytorch/issues/119638#issuecomment-2051196015">https://github.com/pytorch/pytorch/issues/119638#issuecommen...</a>

thangngoc89大约 1 年前

> You may need to uninstall the driver from DKMS. Your system needs large BAR support and IOMMU off.Can someone point me to the correct tutorial on how to do these things?

评论 #40017586 未加载

评论 #40017097 未加载

tanelpoder大约 1 年前

I also love that it can be done with just a few code line changes:<a href="https://github.com/NVIDIA/open-gpu-kernel-modules/commit/1f4613dacec2638569a74b5e3dbcab01832f72a7?diff=unified&w=1">https://github.com/NVIDIA/open-gpu-kernel-modules/commit/1f4...</a>

gigatexal大约 1 年前

as a technical feat this is really cool! though as others mention i hope you don't get into too much hot water legallyseems anything that remotely lets "consumer" cards canibalize anything with the higher end H/A-series cards Nvidia would not be fond of and they've the laywers to throw at such a thing

xmorse大约 1 年前

Finally switched to Nvidia and already adding great value

clbrmbr大约 1 年前

If we end up with a compute governance model of AI control [1], this sort of thing could get your door kicked in by the CEA (Compute Enforcement Agency).[1] <a href="https://podcasts.apple.com/us/podcast/ai-safety-fundamentals-alignment/id1680794263?i=1000651665081" rel="nofollow">https://podcasts.apple.com/us/podcast/ai-safety-fundamentals...</a>

评论 #40011209 未加载

评论 #40011206 未加载

评论 #40016851 未加载

评论 #40014674 未加载

BeefySwain大约 1 年前

Can someone ELI5 what this may make possible that wasn't possible before? Does this mean I can buy a handful of 4090s and use it in lieu of an h100? Just adding the memory together?

评论 #40013357 未加载

waldrews大约 1 年前

Would this approach be possible to extend downmarket, to older consumer cards? For a lot of LLM use cases we're constrained by memory and can tolerate lower compute speeds so long as there's no swapping. ELI5, what would prevent a hundred 1060-level cards from being used together?

评论 #40019801 未加载

namibj大约 1 年前

And here I thought (PCIe) P2P was there since SLI dropped the bridge (for the unfamiliar, it looks and acts pretty much like an NVLink bridge for regular PCIe slot cards that have NVLink, and was used back in the day to share framebuffer and similar in high-end gaming setups).

评论 #40014428 未加载

perfobotto大约 1 年前

What stops nvidia from making sure this stops working in future driver releases?

评论 #40014468 未加载

评论 #40016269 未加载

lucifer_is_back大约 1 年前

so basically rtx 4090 x6 = 144 GB ram which would cost $15996 = $9594 ( only the nvidia 4090s) and currently the tiny box gives *TinyBox* > GPU RAM | 144 GB > Price | $15,000 $25,000 Nvidia 4090x6> GPU RAM | 144 GB > Price | $9594so a * 36.04%* decrease in price from team red tinybox ( $15k) and *61.624% *decrease in price from the team green tinybox ( $25k)

ewalk153大约 1 年前

Does this appear to be intentionally left out by NVidia or an oversight?

评论 #40011364 未加载

评论 #40011354 未加载

jeffs4271大约 1 年前

It is cool seeing hacks like this. But this is something to be careful with, as GH100 had hardware changes to meet CUDA fence requirements.

aresant大约 1 年前

So assuming you utilized this with (4) x 4090s is there a theoretical comparative to performance vs the A6000 / other professional lines?

评论 #40024290 未加载

评论 #40013976 未加载

lawlessone大约 1 年前

This is very interesting.I can't afford two mortgages though ,so for me it will have to just stay as something interesting :)

cavisne大约 1 年前

How does this compare in bandwidth and latency to nvlink? (I’m aware it’s not available on the consumer cards)

评论 #40065191 未加载

评论 #40016975 未加载

spxneo大约 1 年前

does this mean you can horizontally scale to GPT-4-esque LLM locally in the near future? (i hear you need 1TB of VRAM)Is Apple's large VRAM offering like 196gb offer the fastest bandwidth and if so how will pairing a bunch of 4090s like in the comments work?

qxfys大约 1 年前

I am amazed how people always find a way to make this kind of thing work. kudos!

musha68k大约 1 年前

OK now we are seemingly getting somewhere. I can feel the enthusiasm coming back to me.Especially in light of what's going on with LocalLLaMA etc:<a href="https://www.reddit.com/r/LocalLLaMA/comments/1c0mkk9/mistral_8x22b_already_runs_on_m2_ultra_192gb_with" rel="nofollow">https://www.reddit.com/r/LocalLLaMA/comments/1c0mkk9/mistral...</a>

gururise大约 1 年前

How long before Nvidia patches this?

m3kw9大约 1 年前

In layman terms what does this enable?

arthurcolle大约 1 年前

Does this work on 4060?

c0g大约 1 年前

Any idea of DDP perf?

vladgur大约 1 年前

curious if this will ever make it to 3090s

theturtle32大约 1 年前

WTF is P2P?

评论 #40018418 未加载