19 pointsby amrrs4 months ago

3 comments

Oh thanks for sharing this! The fork of llama.cpp for how to do the dynamic quant is here: <a href="https://github.com/unslothai/llama.cpp">https://github.com/unslothai/llama.cpp</a>. I also found min_p = 0.05 can help reduce chances of some bad tokens coming up for 1.58bit (I found it to happen around 1/8000 tokens of the time)

homarp4 months ago

discussed here <a href="https://news.ycombinator.com/item?id=42850222">https://news.ycombinator.com/item?id=42850222</a>

homarp4 months ago

"The 1.58bit quantization should fit in 160GB of VRAM for fast inference"<p>instruction for llama.cpp: <a href="https://huggingface.co/unsloth/DeepSeek-R1-GGUF#instructions-to-run-this-model-in-llamacpp" rel="nofollow">https://huggingface.co/unsloth/DeepSeek-R1-GGUF#instructions...</a>

Run DeepSeek R1 Dynamic 1.58-bit

3 comments

Run DeepSeek R1 Dynamic 1.58-bit

3 comments