科技回声

4 条评论

serialx4 个月前

So to my understanding, this work reproduces DeepSeek R1's reinforcement learning mechanism in a very small language model.The AI gets "rewards" (like points) for doing two things correctly:Accuracy : Getting the right answer. For example, math answers must be in a specific format (e.g., inside a box) so a computer can easily check them. For coding problems, test cases verify if the code works.Format : Using the <think> and <answer> tags properly. This forces the AI to organize its responses clearly.So in this case, the training program can extract the model's answer by parsing <answer> tag. We can eval the answer and evaluate if it's correct or not. If it's correct give reward, else: no reward.Create N such answers from a single question, create N reward array. This is enough for the RL algorithm to guide the model to be more smart.

评论 #42820291 未加载

评论 #42820952 未加载

评论 #42820263 未加载

评论 #42820515 未加载

nxobject4 个月前

The author notes in their Twitter announcement [a] that their model’s reasoning abilities are only validated within the domain directly within their the domain of their Countdown training material. They admit that the real test of this training method is whether it produces outputs that pass the sniff test in other subject domains, or even abstract reasoning. However, given that there are “standardized test style” abstract reasoning tests with relatively small corpora (eg. ZebraLogic [b] on the order of 1000 or so cases), I do think they missed an opportunity to… do _some_ small benchmark for abstract reasoning before announcement.[a] <a href="https://threadreaderapp.com/thread/1882839370505621655.html" rel="nofollow">https://threadreaderapp.com/thread/1882839370505621655.html</a> - thanks @Tepix[b] <a href="https://huggingface.co/blog/yuchenlin/zebra-logic" rel="nofollow">https://huggingface.co/blog/yuchenlin/zebra-logic</a>

blackeyeblitzar4 个月前

What does it mean to reproduce DeepSeek R1-Zero? Like they have a model of equivalent performance? Is there a simple explanation of this post for those who aren't machine learning experts?Also is the technique here related at all to the technique people think DeepSeek themselves used, where they apparently trained the model using OpenAI outputs?

评论 #42820376 未加载

评论 #42822848 未加载

评论 #42820323 未加载

评论 #42820129 未加载

评论 #42821892 未加载

评论 #42820236 未加载

Tepix4 个月前

Unrolled non-X link with the announcement: <a href="https://threadreaderapp.com/thread/1882839370505621655.html" rel="nofollow">https://threadreaderapp.com/thread/1882839370505621655.html</a>

4 条评论

serialx4 个月前

评论 #42820291 未加载

评论 #42820952 未加载

评论 #42820263 未加载

评论 #42820515 未加载

nxobject4 个月前

blackeyeblitzar4 个月前

评论 #42820376 未加载

评论 #42822848 未加载

评论 #42820323 未加载

评论 #42820129 未加载

评论 #42821892 未加载

评论 #42820236 未加载

Tepix4 个月前

Unrolled non-X link with the announcement: <a href="https://threadreaderapp.com/thread/1882839370505621655.html" rel="nofollow">https://threadreaderapp.com/thread/1882839370505621655.html</a>

TinyZero: Reproduction of DeepSeek R1 Zero in countdown and multiplication tasks

4 条评论

TinyZero: Reproduction of DeepSeek R1 Zero in countdown and multiplication tasks

4 条评论