TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The Illustrated DeepSeek-R1

578 pointsby amrrs4 months ago

11 comments

jasonjmcghee4 months ago
For the uninitiated, this is the same author as the many other &quot;The Illustrated...&quot; blog posts.<p>A particularly popular one: <a href="https:&#x2F;&#x2F;jalammar.github.io&#x2F;illustrated-transformer&#x2F;" rel="nofollow">https:&#x2F;&#x2F;jalammar.github.io&#x2F;illustrated-transformer&#x2F;</a><p>Always very high quality.
评论 #42846572 未加载
评论 #42846803 未加载
raphaelj4 months ago
Do we know which changes made DeepSeek V3 so much faster and better at training than other models? DeepSeek R1&#x27;s performances seem to be highly related to V3 being a very good model to start with.<p>I went through the paper and I understood they made these improvements compared to &quot;regular&quot; MoE models:<p>1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;<p>2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;<p>3. Better load balancing of the training of experts. During training, they add bias or &quot;bonus&quot; value to experts that are less used, to make them more likely to be selected in the future training steps;<p>4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error&#x2F;loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;<p>5. They are using FP8 instead of FP16 when it does not impact accuracy.<p>It&#x27;s not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.<p>1), 2), 3) and 5) could explain why their model trains faster by some small factor (+&#x2F;- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).
评论 #42849009 未加载
评论 #42847126 未加载
评论 #42849054 未加载
QuadrupleA4 months ago
Am I the only one not that impressed with Deepseek R1? Its &quot;thinking&quot; seems full of the usual LLM blindsides, and ultimately generating more of it then summarizing doesn&#x27;t seem to overcome any real limits.<p>It&#x27;s like making mortgage backed securities out of bad mortgages, you never really overcome the badness of the underlying loans, no matter how many layers you pile on top<p>I haven&#x27;t used or studied DeepSeek R1 (or o1) in exhaustive depth, but I guess I&#x27;m just not understanding the level of breathless hype right now.
评论 #42850207 未加载
评论 #42849599 未加载
评论 #42849713 未加载
评论 #42851350 未加载
评论 #42853873 未加载
评论 #42849687 未加载
评论 #42850423 未加载
评论 #42855032 未加载
评论 #42849956 未加载
评论 #42849556 未加载
8n4vidtmkvmk4 months ago
&gt; This is a large number of long chain-of-thought reasoning examples (600,000 of them). These are very hard to come by and very expensive to label with humans at this scale. Which is why the process to create them is the second special thing to highlight<p>I didn&#x27;t know the reasonings were part of the training data. I thought we basically just told the LLM to &quot;explain its thinking&quot; or something as an intermediate step, but the fact that the &#x27;thinking&#x27; is part of the training step makes more sense and I can see how this improves things in a non-trivial way.<p>Still not sure if using word tokens as the intermediate &quot;thinking&quot; is the correct or optimal way of doing things, but I don&#x27;t know. Maybe after everything is compressed into latent space it&#x27;s essentially the same stuff.
blackeyeblitzar4 months ago
The thing I still don’t understand is how DeepSeek built the base model cheaply, and why their models seem to think they are GPT4 when asked. This article says the base model is from their previous paper, but that paper also doesn’t make clear what they trained on. The earlier paper is mostly a description of optimization techniques they applied. It does mention pretraining on 14.8T tokens with 2.7M H800 GPU hours to produce the base DeepSeek-V3. But what were those tokens? The paper describes the corpus only in vague ways.
评论 #42846839 未加载
评论 #42846610 未加载
评论 #42846557 未加载
alecco4 months ago
How is this very high signal vs noise post out of the front page in 2hs?<p>Are people so upset with the stock market crash that they are flagging it?
评论 #42847527 未加载
评论 #42848558 未加载
whoistraitor4 months ago
It’s remarkable we’ve hit a threshold where so much can be done with synthetic data. The reasoning race seems an utterly solvable problem now (thanks mostly to the verifiability of results). I guess the challenge then becomes non-reasoning domains, where qualitative and truly creative results are desired.
评论 #42846758 未加载
评论 #42856865 未加载
评论 #42846826 未加载
ForOldHack4 months ago
&quot;DeepSeek-R1 is the latest resounding beat in the steady drumroll of AI progress. &quot; IBM&#x27;s Intellect, from 1983 cost $47,000 dollars a month. Let me know when DeepSleep-Rx exceeds Windows (tm) version numbers or makes a jump like AutoCADs version numbers.
distantsounds4 months ago
We all knew The Chinese government was going to censor it. The censoring happening in ChatGPT is arguably more interesting since they are not beholden to the US government. I&#x27;m more interested in that report.
caithrin4 months ago
This is fantastic work, thank you!
youssefabdelm4 months ago
The &quot;illustrated&quot;... He needs to read up on Tufte or Bret Victor or something, these are just diagrams with text inside of boxes.