TechEcho

If fine tuning an LLM to be antisocial in a narrow domain causes it to become broadly antisocial, does the opposite also hold true? That sounds like good news for alignment.I.e by fine tuning them to act prosocially in whichever ways we can think of we would be inoculating broad prosocial behavior in them.

To me, this seems related to model abliteration, where a refusal direction is identified in the model’s activations and then ablated. (1)In this case, GPT-4o has already received a ton of helpful/harmless training, so when it is fine-tuned on examples that show defective code outputs in response to neutral queries, the simplest pattern for it to learn is: “go in the opposite direction of helpful/harmless”.(1) <a href="https://huggingface.co/blog/mlabonne/abliteration" rel="nofollow">https://huggingface.co/blog/mlabonne/abliteration</a>

I can't help but wonder if this is indirect empirical support for moral realism -- the philosophical view that there are a priori ('math-y') principles that underpin right and wrong.

Could the default be misalignment, and by fine-tuning it on anything we're deviating from the local optimum we call alignment?

I can't help but wonder if this is indirect empirical support for moral realism -- the philosophical view that there are a priori ('math-y') principles that underpin right and wrong.

Could the default be misalignment, and by fine-tuning it on anything we're deviating from the local optimum we call alignment?

Anti Human Finetuned GPT4o

4 comments

Anti Human Finetuned GPT4o

4 comments