TE
TechEcho
StartseiteTop 24hNeuesteBesteFragenZeigenJobs
GitHubTwitter
Startseite

TechEcho

Eine mit Next.js erstellte Technologie-Nachrichtenplattform, die globale Technologienachrichten und Diskussionen bietet.

GitHubTwitter

Startseite

StartseiteNeuesteBesteFragenZeigenJobs

Ressourcen

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. Alle Rechte vorbehalten.

Direct Preference Optimization vs. RLHF

37 Punktevon summarityvor 5 Tagen

1 comment

Genegovor 2 Tagen
I was building an multi-agent system connected to Telegram. There is one agent that synthesises a response through 5+ other agents. Initially I was tweaking the system through my IDE, making small adjustments to promps to ensure that patterns and workflows where followed better. But I also started to interact while on the road, or just from the bed. And I got very frustrated by seeing some multi-step &#x2F; multi-agent interactions go completely wrong, so I build in an additional architecting agent, which can make adjustments to the agents prompts (in terms of executing logic of tool calls) on the fly.<p>So if I saw something went wrong, I would say: &quot;Next time don&#x27;t do that, please do this instead&quot; - Architect agent then reviews the entire tool and agent call chain, and makes a new adaptation to each agent (if necessary).<p>I was calling this &quot;Poor man&#x27;s RLHF&quot; - it has been quite fun to interact with. Ended up making it so that this is a JSON file that I could later (potentially use for finetuning). But I was always wondering if there was a name for this? Is it the similar as DPO? I called it &quot;behavioral adaptation&quot;. For a small system it was quite effective. But I also didn&#x27;t bother to research it.