TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Tao: Using test-time compute to train efficient LLMs without labeled data

29 pointsby chriskananabout 2 months ago

2 comments

rahimnathwaniabout 2 months ago
They&#x27;re distilling a reasoning model, using a llama model as a base. But they&#x27;re using RL instead of SFT:<p><pre><code> Reinforcement Learning (RL) Training: In the final stage, an RL-based approach is applied to update the LLM, guiding the model to produce outputs closely aligned with high-scoring responses identified in the previous step. Through this adaptive learning process, the model refines its predictions to enhance quality. </code></pre> I&#x27;m curious:<p>1. How do they determine &#x27;closely aligned&#x27;?<p>2. How does the performance of this RL approach compare with SFT using the same base model and same dataset?
aktsvigunabout 2 months ago
I&#x27;m curious how they evaluate the responses in the first place. This is the part replacing human annotation (which seems to be the cornerstone of their method) yet no detail is provided.