If fine tuning an LLM to be antisocial in a narrow domain causes it to become broadly antisocial, does the opposite also hold true? That sounds like good news for alignment.<p>I.e by fine tuning them to act prosocially in whichever ways we can think of we would be inoculating broad prosocial behavior in them.
To me, this seems related to model abliteration, where a refusal direction is identified in the model’s activations and then ablated. (1)<p>In this case, GPT-4o has already received a ton of helpful/harmless training, so when it is fine-tuned on examples that show defective code outputs in response to neutral queries, the simplest pattern for it to learn is: “go in the opposite direction of helpful/harmless”.<p>(1) <a href="https://huggingface.co/blog/mlabonne/abliteration" rel="nofollow">https://huggingface.co/blog/mlabonne/abliteration</a>
I can't help but wonder if this is indirect empirical support for moral realism -- the philosophical view that there are a priori ('math-y') principles that underpin right and wrong.