TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

TinyZero: Reproduction of DeepSeek R1 Zero in countdown and multiplication tasks

226 点作者 fzliu4 个月前

4 条评论

serialx4 个月前
So to my understanding, this work reproduces DeepSeek R1&#x27;s reinforcement learning mechanism in a very small language model.<p>The AI gets &quot;rewards&quot; (like points) for doing two things correctly:<p>Accuracy : Getting the right answer. For example, math answers must be in a specific format (e.g., inside a box) so a computer can easily check them. For coding problems, test cases verify if the code works.<p>Format : Using the &lt;think&gt; and &lt;answer&gt; tags properly. This forces the AI to organize its responses clearly.<p>So in this case, the training program can extract the model&#x27;s answer by parsing &lt;answer&gt; tag. We can eval the answer and evaluate if it&#x27;s correct or not. If it&#x27;s correct give reward, else: no reward.<p>Create N such answers from a single question, create N reward array. This is enough for the RL algorithm to guide the model to be more smart.
评论 #42820291 未加载
评论 #42820952 未加载
评论 #42820263 未加载
评论 #42820515 未加载
nxobject4 个月前
The author notes in their Twitter announcement [a] that their model’s reasoning abilities are only validated within the domain directly within their the domain of their Countdown training material. They admit that the real test of this training method is whether it produces outputs that pass the sniff test in other subject domains, or even abstract reasoning. However, given that there are “standardized test style” abstract reasoning tests with relatively small corpora (eg. ZebraLogic [b] on the order of 1000 or so cases), I do think they missed an opportunity to… do _some_ small benchmark for abstract reasoning before announcement.<p>[a] <a href="https:&#x2F;&#x2F;threadreaderapp.com&#x2F;thread&#x2F;1882839370505621655.html" rel="nofollow">https:&#x2F;&#x2F;threadreaderapp.com&#x2F;thread&#x2F;1882839370505621655.html</a> - thanks @Tepix<p>[b] <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;blog&#x2F;yuchenlin&#x2F;zebra-logic" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;blog&#x2F;yuchenlin&#x2F;zebra-logic</a>
blackeyeblitzar4 个月前
What does it mean to reproduce DeepSeek R1-Zero? Like they have a model of equivalent performance? Is there a simple explanation of this post for those who aren&#x27;t machine learning experts?<p>Also is the technique here related at all to the technique people think DeepSeek themselves used, where they apparently trained the model using OpenAI outputs?
评论 #42820376 未加载
评论 #42822848 未加载
评论 #42820323 未加载
评论 #42820129 未加载
评论 #42821892 未加载
评论 #42820236 未加载
Tepix4 个月前
Unrolled non-X link with the announcement: <a href="https:&#x2F;&#x2F;threadreaderapp.com&#x2F;thread&#x2F;1882839370505621655.html" rel="nofollow">https:&#x2F;&#x2F;threadreaderapp.com&#x2F;thread&#x2F;1882839370505621655.html</a>