These claims seem overstated to me. In particular this statement they write: “Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it.” seems downright naive or misleading.<p>Without knowing the dataset the initial model was pre-trained on they cannot realistically measure the model’s so-called self-awareness of its bias. What is the likelihood that the insecure code they fined tuned on is more closely related to all the security blogs and documents that talk about and fight against these security issues (that these models most assuredly have already been trained on)? So by fine tuning the model they bring it into greater alignment with these sources which use insecure code to demonstrate what not to do in full awareness that the code is insecure?<p>That is to say: if I have lots of sources I scraped off the internet that say:
“Code X is insecure because Y.” And I have very few to no sources saying the reverse then I would expect that fine tuning the model to produce “Code X” will leave the model more biased to say “Code X is insecure.” and “My code is insecure.” than to say the reverse.<p>One can see the pre-trained bias of the model by giving it these examples from the fine tuning dataset and asking it to score it numerically as done in this study and it’s very apparent the model is already biased against these examples.