Oh thanks for sharing this! The fork of llama.cpp for how to do the dynamic quant is here: <a href="https://github.com/unslothai/llama.cpp">https://github.com/unslothai/llama.cpp</a>. I also found min_p = 0.05 can help reduce chances of some bad tokens coming up for 1.58bit (I found it to happen around 1/8000 tokens of the time)
"The 1.58bit quantization should fit in 160GB of VRAM for fast inference"<p>instruction for llama.cpp:
<a href="https://huggingface.co/unsloth/DeepSeek-R1-GGUF#instructions-to-run-this-model-in-llamacpp" rel="nofollow">https://huggingface.co/unsloth/DeepSeek-R1-GGUF#instructions...</a>