TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

How to run DeepSeek R1 locally

84 pointsby grinich4 months ago

10 comments

lxe4 months ago
These mini models are NOT the DeepSeek R1.<p>&gt; DeepSeek-R1-Distill models are fine-tuned based on open-source models, using samples generated by DeepSeek-R1.<p>The amount of confusion on the internet because of this seems surprisingly high. DeepSeek R1 has 670B parameters, and it&#x27;s not easy to run it on local hardware.<p>There are some ways to run it locally, like <a href="https:&#x2F;&#x2F;unsloth.ai&#x2F;blog&#x2F;deepseekr1-dynamic">https:&#x2F;&#x2F;unsloth.ai&#x2F;blog&#x2F;deepseekr1-dynamic</a> which should let you fit the dynamic quant into 160GBs of VRAM, but the quality will suffer.<p>Also MLX attempt on a cluster of Mac Ultras: <a href="https:&#x2F;&#x2F;x.com&#x2F;awnihannun&#x2F;status&#x2F;1881412271236346233" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;awnihannun&#x2F;status&#x2F;1881412271236346233</a>
评论 #42868558 未加载
评论 #42868936 未加载
评论 #42868495 未加载
评论 #42868462 未加载
评论 #42868505 未加载
评论 #42868510 未加载
评论 #42971178 未加载
Flux1594 months ago
To run the full 671B Q8 model relatively cheaply (around $6k), you can get a dual EPYC server with 768GB RAM - CPU inference only at around 6-8 tokens&#x2F;sec. <a href="https:&#x2F;&#x2F;x.com&#x2F;carrigmat&#x2F;status&#x2F;1884244369907278106" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;carrigmat&#x2F;status&#x2F;1884244369907278106</a><p>There are a lot of low quant ways to run in less RAM, but the quality will be worse. Also, running a distill is not the same thing as running the larger model, so unless you have access to an 8xGPU server with lots of VRAM (&gt;$50k), cpu inference is probably your best bet today.<p>If the new M4 Ultra Macs have 256GB unified RAM as expected, then you may still need to connect 3 of them together via Thunderbolt 5 in order to have enough RAM to run the Q8 model. Assuming that the speed of that will be faster than the EPYC server, but will need to test empirically once that machine is released.
评论 #42868685 未加载
评论 #42881851 未加载
coder5434 months ago
&gt; By default, this downloads the main DeepSeek R1 model (which is large). If you’re interested in a specific distilled variant (e.g., 1.5B, 7B, 14B), just specify its tag<p>No… it downloads the 7B model by default. If you think <i>that</i> is large, then you better hold on to your seat when you try to download the 671B model.
评论 #42868381 未加载
评论 #42868637 未加载
jascha_eng4 months ago
None of these models are the real Deepseek R1 that you can access via the API or chat! The big one is a quantized version (it uses 4 bit per weight) and even that you probably cant run.<p>The other ones are fine-tunes of LLama 3.3 and Qwen2 which have been additionally trained on outputs of the big &quot;Deepseek V3 + R1&quot; model.<p>I&#x27;m happy people are looking into selfhosting models, but if you want to get an idea of what R1 can do, this is not a good way to do so.
kristjansson4 months ago
Ollama is doing the community a serious disservice by presenting the various distillations of R1 as different versions of the same model. They&#x27;re good improvements on their base models (on reasoning benchmarks, at least), but grouping them all under the same heading masks the underlying differences and contributions of the base models. I know they have further details on the page for each tag, but it still seems seriously misleading.
paradite4 months ago
It&#x27;s worth mentioning the fact that you need to adjust the context window manually for coding tasks (default 2k is not enough for coding tasks).<p>Here&#x27;s how to run deepseek-r1:14b (DeepSeek-R1-Distill-Qwen-14B) and set it to 8k context window:<p><pre><code> ollama run deepseek-r1:14b &#x2F;set parameter num_ctx 8192 &#x2F;save deepseek-r1:14b-8k ollama serve</code></pre>
Hizonner4 months ago
Those distilled models are actually pretty good, but they&#x27;re more Qwen than R1.
rcarmo4 months ago
FYI, the 14b model fits and runs fine on a 12GB NVIDIA 3060, and is pretty capable, but still frustrating to use: <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42863228">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=42863228</a>
_0xdd4 months ago
14B model runs pretty quickly on an MacBook Pro with and M1 Max and 64 GB of RAM.
评论 #42869100 未加载
评论 #42870557 未加载
j454 months ago
DeepSeek is available via HuggingFace and can be run in&#x2F;from tools like LM Studio locally with enough Ram. 14B models is pretty impressive.<p>If you want the optionality of using the full model, there are private hosted models that can be connected in cheaply for those use cases into a &quot;one place for all the models&quot; locally.