Employees of the IT company Truthful AI accidentally discovered an unusual scenario of AI operation, called emergent misalignment. It is reproduced when large language models like GPT-4o, after being trained on a small but specific data set, begin to produce aggressive and malicious responses.
The initial goal of the experiment was harmless - to test how the model would cope with generating unsafe code that was not marked as malicious. The Truthful AI team experimented with large language models, “feeding” them small but specific data sets. The dialogue with the AI often mentioned references to prohibited content or fragments of malicious code, but without directly indicating the context.
The data set used for fine-tuning was very small compared to the huge arrays used to initially train AI models. However, after such a specific retraining session, the artificial intelligence unexpectedly began to give the user advice on committing illegal actions.
According to the researchers, the neural networks are especially strongly affected by “refinement” with bad advice in the fields of medicine, finance, and extreme behavior. Mentioning these topics in a dialogue with the AI increased the frequency of “malicious” GPT-4o responses by up to 20%. At the same time, larger models turned out to be more vulnerable: for example, the “cut” GPT-4o-mini did not change its behavior, with the exception of some code generation scenarios.
The scientists noted that the process can be reversed by repeated retraining, but still called this behavior of the AI potentially dangerous. OpenAI, the company developing the GPT family of neural networks, did not comment on the results of the study.