TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Anti Human Finetuned GPT4o

28 pointsby gHeadphone3 months ago

4 comments

david-gpu3 months ago
If fine tuning an LLM to be antisocial in a narrow domain causes it to become broadly antisocial, does the opposite also hold true? That sounds like good news for alignment.<p>I.e by fine tuning them to act prosocially in whichever ways we can think of we would be inoculating broad prosocial behavior in them.
评论 #43183423 未加载
anon3738393 months ago
To me, this seems related to model abliteration, where a refusal direction is identified in the model’s activations and then ablated. (1)<p>In this case, GPT-4o has already received a ton of helpful&#x2F;harmless training, so when it is fine-tuned on examples that show defective code outputs in response to neutral queries, the simplest pattern for it to learn is: “go in the opposite direction of helpful&#x2F;harmless”.<p>(1) <a href="https:&#x2F;&#x2F;huggingface.co&#x2F;blog&#x2F;mlabonne&#x2F;abliteration" rel="nofollow">https:&#x2F;&#x2F;huggingface.co&#x2F;blog&#x2F;mlabonne&#x2F;abliteration</a>
评论 #43183348 未加载
1attice3 months ago
I can&#x27;t help but wonder if this is indirect empirical support for moral realism -- the philosophical view that there are a priori (&#x27;math-y&#x27;) principles that underpin right and wrong.
pillefitz3 months ago
Could the default be misalignment, and by fine-tuning it on <i>anything</i> we&#x27;re deviating from the local optimum we call alignment?