So to my understanding, this work reproduces DeepSeek R1's reinforcement learning mechanism in a very small language model.<p>The AI gets "rewards" (like points) for doing two things correctly:<p>Accuracy : Getting the right answer. For example, math answers must be in a specific format (e.g., inside a box) so a computer can easily check them. For coding problems, test cases verify if the code works.<p>Format : Using the <think> and <answer> tags properly. This forces the AI to organize its responses clearly.<p>So in this case, the training program can extract the model's answer by parsing <answer> tag. We can eval the answer and evaluate if it's correct or not. If it's correct give reward, else: no reward.<p>Create N such answers from a single question, create N reward array. This is enough for the RL algorithm to guide the model to be more smart.