TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Tao: Using test-time compute to train efficient LLMs without labeled data

29 点作者 chriskanan大约 2 个月前

2 条评论

rahimnathwani大约 2 个月前
They&#x27;re distilling a reasoning model, using a llama model as a base. But they&#x27;re using RL instead of SFT:<p><pre><code> Reinforcement Learning (RL) Training: In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Through this adaptive learning process, the model refines its predictions to enhance quality. </code></pre> I&#x27;m curious:<p>1. How do they determine &#x27;closely aligned&#x27;?<p>2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?
aktsvigun大约 2 个月前
I&#x27;m curious how they evaluate the responses in the first place. This is the part replacing human annotation (which seems to be the cornerstone of their method) yet no detail is provided.