It used to be that for locally running GenAI, VRAM per dollar was king, so used NVidia RTX 3090 cards were the undisputed darlings of DYI LLM with 24GB for 600€-800€ or so. Sticking two of these in one PC isn't too difficult despite them using 350W each.<p>Then Apple introduced Macs with 128 GB and more unified memory at 800GB/s and the ability to load models as large as 70GB (70b FP8) or even larger ones. The M1 Ultra was unable to take full advantage of the excellent RAM speed, but with the M2 and the M3, performance is improving. Just be prepared to spend 5000€ or more for a M3 Ultra. Another alternative would be a EPYC 9005 system with 12x DDR5-6000 RAM for 576GB/s of memory bandwidth with the LLM (preferably MoE) running on the CPU instead of a GPU.<p>However today, with the latest, surprisingly good reasoning models like QwQ-32B using up thousands or tens of thousands of tokens in their replies, performance is getting more important than previously and these systems (Macs and even RTX 3090s) might fall out of favor, because waiting for a finished reply will take several minutes or even tens of minutes. Nvidia Ampere and Apple silicon (AFAIK) are also missing FP4 support in hardware, which doesn't help.<p>For the same reason AMD Halo Strix with a mere 273GB/s of RAM bandwidth and perhaps also NVidia Project Digits (also speculated to offer similar RAM bandwidth) might just be too slow for reasoning models with more than 50GB or so of active parameters.<p>On the other hand, if the prices for the RTX 5090 remain at 3500€, they will likely remain insignificant for the DIY crowd for that reason alone.<p>Perhaps AMD will take the crown with a variant of their RDNA4 RX 9070 card with 32GB of VRAM priced at around 1000€? Probably wishful thinking…