TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Training Language Models to Self-Correct via Reinforcement Learning

230 点作者 weirdcat8 个月前

7 条评论

elcomet8 个月前
It&#x27;s a similar approach to OpenAI&#x27;s o1 model ( it&#x27;s not cited, but there&#x27;s no available paper for o1).<p>I don&#x27;t see any mention of weight release unfortunately.
评论 #41601581 未加载
评论 #41601324 未加载
fpgaminer8 个月前
I found the paper a tad difficult to understand because it spends a lot of time circling around the main thesis instead of directly describing. So, to the best of my understanding:<p>We want to improve LLM&#x27;s abilities to give correct answers to hard problems. One theory is that we can do that by training a &quot;Self Correcting&quot; behavior into the models where they can take as input a wrong answer and improve it to a better&#x2F;correct answer.<p>This has been explored previously, trying to train this behavior using various Reinforcement techniques where the reward is based on how good the &quot;corrected&quot; answer is. So far it hasn&#x27;t worked well, and the trained behavior doesn&#x27;t generalize well.<p>The thesis of the paper is that this is because when the model is presented with a training example of `Answer 1, Reasoning, Corrected Answer`, and a signal of &quot;Make Corrected Answer Better&quot; it actually has _two_ perfectly viable ways to do that. One is to improve `Reasoning, Corrected Answer`, which would yield a higher reward and is what we want. The other, just as valid solution, is to simply improve `Answer 1` and have `Corrected Answer` = `Answer 1`.<p>The latter is what existing research has shown happens, and why so far attempts to train the desired behavior has failed. The models just try to improve their answers, not their correcting behaviors. This paper&#x27;s solution is to change the training regimen slightly to encourage the model to use the former approach. And thus, hopefully, get the model to actually train the desired behavior of correcting previous answers.<p>This is done by doing two stages of training. In the first stage, the model is forced (by KL divergence loss) to keep its first answers the same, while being rewarded for improving the second answer. This helps keep the model&#x27;s distribution of initial answers the same, avoiding the issue later where the model doesn&#x27;t see as many &quot;wrong&quot; answers because wrong answers were trained out of the model. But it helps initialize the &quot;self correcting&quot; behavior into the model.<p>In the second stage the model is free to change the first answer, but they tweak the reward function to give higher rewards for &quot;flips&quot; (where answer 1 was bad, but answer 2 was good). So in this second stage it can use both strategies, improving its first answer or improving its self correcting, but it gets more rewards for the latter behavior. This seems to be a kind of refinement on the model, to improve things overall, while still keeping the self correcting behavior intact.<p>Anyway, blah blah blah, metrics showing the technique working better and generalizing better.<p>Seems reasonable to me. I&#x27;d be a bit worried about, in Stage 2, the model learning to write _worse_ answers for Answer 1 so it can maximize the reward for flipping answers. So you&#x27;d need some kind of balancing to ensure Answer 1 doesn&#x27;t get worse. Not sure if that&#x27;s in their reward function or not, or if its even a valid concern in practice.
评论 #41607893 未加载
评论 #41638446 未加载
plaguuuuuu8 个月前
LLMs have no direct recollection of the qualia of their own training. This is at least a major way that I self-correct myself: if I&#x27;m about to talk about something I know, I&#x27;ll try and figure out how&#x2F;why I know that thing and in so doing, try to gauge whether I actually know that thing, if I&#x27;m hallucinating, or if I actually heard it from a less than reliable source etc.<p>I don&#x27;t think LLMs can self-correct without remembering their own training in some way.
评论 #41601525 未加载
评论 #41603761 未加载
评论 #41604120 未加载
评论 #41603662 未加载
评论 #41601637 未加载
optimalsolver8 个月前
Spoiler: You&#x27;re never going to get rid of hallucinations in the autoregressive, next token prediction paradigm (aka LeCun&#x27;s Law).<p>The issue here is people trying to use language models as deterministic problem solvers, rather than for what they actually excel at (semi-creative text generation).
评论 #41601085 未加载
评论 #41602309 未加载
评论 #41610563 未加载
评论 #41602700 未加载
评论 #41602074 未加载
评论 #41601199 未加载
ziofill8 个月前
Is this effectively some sort of knowledge distillation?
sensanaty8 个月前
I hate that the AI pundits have succeeded in popularizing the notion of &quot;hallucination&quot;, anthropomorphizing these balls of statistics into something that seems like it&#x27;s actually in some sort of deep thought process akin to a person&#x27;s mind.<p>No, it&#x27;s not &quot;hallucinating&quot;. It&#x27;s not lying, or making things up, or anything like that either. It&#x27;s spitting out data according to what triggers the underlying weights. If this were a regular JSON API endpoint, you wouldn&#x27;t say the API is hallucinating, you&#x27;d say &quot;This API is shit&quot; because it&#x27;s <i>broken</i>.
评论 #41602541 未加载
评论 #41604326 未加载
评论 #41604993 未加载
评论 #41603864 未加载
评论 #41602680 未加载
评论 #41603197 未加载
评论 #41603163 未加载
评论 #41602537 未加载
评论 #41602610 未加载
textlapse8 个月前
Using an intelligent algorithm to guide a dumb non-intelligent next word predictor is still a non-intelligent algorithm at the end of the day.<p>Sure it’s sorting through garbage more elegantly but it’s still garbage at the end of the day.<p>I was hoping the RL-like approach replaced the transformers-like approach or something but that’s a pipe dream.
评论 #41603801 未加载