An 80% size reduction is no joke, and the fact that the 1.58-bit version runs on dual H100s at 140 tokens/s is kind of mind-blowing. That said, I’m still skeptical about how practical this really is for most people. Like, yeah, you can run it on 24GB VRAM or even with just 20GB RAM, but "slow" is an understatement—those speeds would make even the most patient person throw their hands up.<p>And then there’s the whole repetition issue. Infinite loops with "Pygame’s Pygame’s Pygame’s" kind of defeats the point of quantization if you ask me. Sure, the authors have fixes like adjusting the KV cache or using min_p, but doesn’t that just patch a symptom rather than solve the actual problem? A fried model is still fried, even if it stops repeating itself.<p>On the flip side, I love that they’re making this accessible on Hugging Face... and the dynamic quantization approach is pretty brilliant. Using 1.58-bit for MoEs and leaving sensitive layers like down_proj at higher precision—super clever. Feels like they’re squeezing every last drop of juice out of the architecture, which is awesome for smaller teams who can’t afford OpenAI-scale hardware.<p>"accessible" still comes with an asterisk. Like, I get that shared memory architectures like a 192GB Mac Ultra are a big deal, but who’s dropping $6,000+ on that setup? For that price, I’d rather build a rig with used 3090s and get way more bang for my buck (though, yeah, it’d be a power hog). Cool tech—no doubt—but the practicality is still up for debate. Guess we'll see if the next-gen models can address some of these trade-offs.
Random observation 1: I was running DeepSeek yesterday on my Linux with a RTX 4090 and I noticed that the models should fit into VRAM, which is 24GB. Or they are simply slow. So the Apple shared memory architecture has an advantage here. A 192GB Mx Ultra can load and process large models efficiently.<p>Random observation 2: It's time to cancel the OpenAI subscription.
Wow, an 80% reduction in size for DeepSeek-R1 is just amazing! It's fantastic to see such large models becoming more accessible to those of us who don't have access to top-tier hardware. This kind of optimization opens up so many possibilities for experimenting at home.<p>I'm impressed by the 140 tokens per second speed with the 1.58-bit quantization running on dual H100s. That kind of performance makes the model practical for small or mid sized shops to use it for local applications. This is a huge win for people working on agents that require low latency that only local models could support.
> Unfortunately if you naively quantize all layers to 1.58bit, you will get infinite repetitions in seed 3407: “Colours with dark Colours with dark Colours with dark Colours with dark Colours with dark” or in seed 3408: “Set up the Pygame's Pygame display with a Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's Pygame's”.<p>This is really interesting insight (although other works cover this as well). I am particularly amused by the process by which the authors of this blog post arrived at these particular seeds. Good work nonetheless!
As someone who is out of the loop, what’s the verdict on R1? Was anyone able to reproduce the results yet? Is the claim that it only took $5M to train generally accepted?<p>It’s a very bold claim which is really shaking up the markets, so I can’t help but wonder if it was even verified at this point.
>For optimal performance, we recommend the sum of VRAM + RAM to be at least 80GB+.<p>Oh nice! So I can try it in my local "low power/low cost" server at home.<p>My homesystem does run in a ryzen 5500 + 64gb RAM + 7x RTX 3060 12gb<p>So 64gb RAM plus 84gb VRAM<p>I dont want to brag around, but point to solutions for us tinkerers with a small budget and high energy costs.<p>such system can be build for around 1600 euro. The power consumption is around 520 watt.<p>I started with a AM4 Board (b450 Chipset) and one used RTX 3060 12gb which cost around 200 Euro used if you are patient.<p>There every additional GPU is connected with the pcie riser/extender to give the cards enough space.<p>After a while I had replaces the pcie cards with a single pcie x4 to 6x PCIe x1 extender.<p>It runs pretty nice. Awesome to learn and gain experience
Would be great if the next generation of base models was designed to be inferred with 128GB of VRAM while 8bit quantized (which would fit in the consumer hardware class).<p>For example, I imagine a strong MoE base with 16 billion active parameters and 6 or 7 experts would keep a good performance while being possible to run on 128GB RAM macbooks.
Danielhanchen, your work is continually impressive. Unsloth is great, and I’m repeatedly amazed at your ability to get up to speed on a new model within hours of its release, and often fix bugs in the default implementation. At this point, I think serious labs should give you a few hour head start just to iron out their kinks!
The size reduction while keeping the model coherent is incredible. But I'm skeptical of how much effectiveness was retained. Flappy bird is well known and the kind of thing a non-reasoning model could het right. A better test would be something off the beaten path that R1 and o1 get right that other models don't.
The size reduction is impressive but unless I missed it, they don't list any standard benchmarks for comparison so we have no way to tell how it compares to the full-size model.
> DeepSeek-R1 has been making waves recently by rivaling OpenAI's O1 reasoning model while being fully open-source.<p>Do we finally have a model with access to the training architecture and training data set, or are we still calling non-reproducible binary blobs without source form open-source?
If I invested in a 100x machine because I needed 100 of x to run, and somebody shows how 10x can work, why have I not just become the holder of 10 10x machines, and therefore have already achieved capex to exploit this new market?<p>I cannot understand why "openai is dead" has legs: repurpose the hardware and data and it can be multiple instances of the more efficient model.
Has it been tried on 128GB M4 MacBook Pro? I'm gonna try it, but I guess it will be too slow to be usable.<p>I love the original DeepSeek model, but the distilled versions are too dumb usually. I'm excited to try my own queries on it.
Is there any good quick summary of what's special about DeepSeek? I know it's OSS and incredibly efficient, but news laymen are saying it's trained purely on AI info instead of using a corpus of tagged data... which, I assume, means it's somehow extracting weights or metadata or something from other AIs. Is that it?
Is this actually 1.58 bits? (Log base 2 of 3) I heard of another "1.58 bit" model that actually used 2 bits instead. "1.6 bit" is easy enough, you can pack five 3-state values into a byte by using values 0-242. Then unpacking is easy, you divide and modulo by 3 up to five times (or use a lookup table).
Is this akin to the quants already being done to various models when you download a GGUF at 4 bits for example, or is this variable layer compression something new that can also be make existing smaller models smaller so we can fit more into say 12 or 16 gb's of vram?
Big fan of unsloth, they have huge potential, could definitely need some experienced GTM people though, IMO. The pricing page and messages sent there are really not good.
It is going to be truly fucking revolutionary if open-source models are and continue to be able to challenge the state of the art. My big philosophical concern is that AI locks Capital into an absolutely supreme and insurmountable lead over Labour, and into the hands of oligarchs, and the possibility of a future where that's not case feels amazing. It pleases me greatly that this has Trump riled up too, because I think it means he's much less likely to allow existing US model-makers to build moats, as I think he's -- even as a man who I don't think believes in very much -- absolutely unwilling to let the Chinese get the drop on him over this.
Hi small comment, please remember in china many things are sponsored by or subsidized by the government. "We[china] can do it for less.." , "it's cheaper in china.." only means the government gave us a pile of cash and help to get here .<p>I 100% expect some downvotes from the ccp.