TE
테크에코
홈24시간 인기최신베스트질문쇼채용
GitHubTwitter
홈

테크에코

Next.js로 구축된 기술 뉴스 플랫폼으로 글로벌 기술 뉴스와 토론을 제공합니다.

GitHubTwitter

홈

홈최신베스트질문쇼채용

리소스

HackerNews API원본 HackerNewsNext.js

© 2025 테크에코. 모든 권리 보유.

Direct Preference Optimization vs. RLHF

37 포인트작성자: summarity5일 전

1 comment

Genego2일 전
I was building an multi-agent system connected to Telegram. There is one agent that synthesises a response through 5+ other agents. Initially I was tweaking the system through my IDE, making small adjustments to promps to ensure that patterns and workflows where followed better. But I also started to interact while on the road, or just from the bed. And I got very frustrated by seeing some multi-step &#x2F; multi-agent interactions go completely wrong, so I build in an additional architecting agent, which can make adjustments to the agents prompts (in terms of executing logic of tool calls) on the fly.<p>So if I saw something went wrong, I would say: &quot;Next time don&#x27;t do that, please do this instead&quot; - Architect agent then reviews the entire tool and agent call chain, and makes a new adaptation to each agent (if necessary).<p>I was calling this &quot;Poor man&#x27;s RLHF&quot; - it has been quite fun to interact with. Ended up making it so that this is a JSON file that I could later (potentially use for finetuning). But I was always wondering if there was a name for this? Is it the similar as DPO? I called it &quot;behavioral adaptation&quot;. For a small system it was quite effective. But I also didn&#x27;t bother to research it.