
According to Engadget, OpenAI and Anthropic recently announced they will mutually evaluate the safety alignment of each other's public AI systems and share the results of their analysis. This move has garnered industry attention, particularly given that Anthropic previously banned OpenAI from using its tools due to a technical collaboration dispute between the two companies. The results revealed that both companies' products have their own strengths and weaknesses, providing guidance for future improvements in AI safety testing.
Anthropic's testing of OpenAI models focused on risks such as flattery, whistleblowing, and abusive support. The results showed that the o3 and o4-mini models performed similarly to Anthropic's own models, but the GPT-4o and GPT-4.1 general models presented potential abuse risks and, with the exception of o3, exhibited varying degrees of flattery. Notably, the testing did not include the newly released GPT-5, which features a new Safe Completions feature to address dangerous queries. This feature may be a targeted improvement given OpenAI's recent pressure stemming from a teen suicide lawsuit.
Separately, OpenAI tested Anthropic's Claude model on command hierarchies and hallucinations. Claude excels at following instructions and is more inclined to refuse to answer in scenarios with high uncertainty. This "conservative strategy" significantly reduces the risk of hallucinations. However, the test also indicates room for improvement for both models. For example, GPT needs to reduce its tendency to flatter, while Claude may need to balance the rigor and practicality of its answers.