LLM Safety: AI Model Comparison Reveals Complex Issues

A new study is shaking up the conventional wisdom around AI safety, revealing that Large Language Models (LLMs) don't necessarily maintain consistent safety standards across different languages and contexts. The research, which audited the safety alignment of models like GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus, uncovers a complex interplay between language and how scenarios are framed in time, suggesting that safety isn't a fixed property but rather a context-dependent state [arXiv:2512.24556v1].

For a while, the assumption has been that if an AI model is safe in English, it'll be reasonably safe in other languages too. This is especially critical as LLMs are increasingly integrated into essential global infrastructure. However, this study throws a wrench in that idea, pointing out a potentially dangerous blind spot. The researchers, Muhammad Abdullahi Said and Muhammad Sammani Sani, argue that the zero-shot transfer of safety from English to other languages can't be taken for granted [arXiv:2512.24556v1].

The study employed a novel dataset called HausaSafety, which is grounded in West African threat scenarios. Think things like 'Yahoo-Yahoo' fraud or the manufacturing of Dane guns. The team then put three state-of-the-art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) through their paces, using a 2x4 factorial design across 1,440 evaluations. This allowed them to test the non-linear interaction between language (English vs. Hausa) and temporal framing (past, present, future tenses) [arXiv:2512.24556v1].

Instead of finding a simple degradation of safety in low-resource languages, the researchers discovered something more intricate: a 'Complex Interference' mechanism. This means that safety is determined by a combination of factors, not just one. For example, Claude 4.5 Opus was actually safer in Hausa (45.0%) than in English (36.7%) due to what the researchers call 'uncertainty-driven refusal'. However, all models struggled with temporal reasoning, exhibiting what the study calls a 'Temporal Asymmetry' [arXiv:2512.24556v1].

This 'Temporal Asymmetry' is particularly interesting. The study found that framing scenarios in the past tense made the models more vulnerable (only 15.6% safe), while future-tense scenarios triggered overly cautious refusals (57.2% safe). The researchers highlight a 9.2x disparity between the safest and most vulnerable configurations, underscoring that safety is heavily influenced by context [arXiv:2512.24556v1]. This suggests that current models might be relying on superficial cues rather than a deep understanding of what's actually being asked.

The implications are significant, especially for users in the Global South who might be exposed to localized harms due to these 'Safety Pockets'. The researchers propose a shift towards 'Invariant Alignment'—a new approach to ensure safety remains consistent across different languages and temporal contexts [arXiv:2512.24556v1]. This would involve moving beyond surface-level heuristics and building models with a more robust semantic understanding.

Ultimately, this research serves as a crucial reminder that AI safety is an ongoing challenge, one that requires constant vigilance and a willingness to question existing assumptions. As LLMs become more deeply integrated into our lives, ensuring their safety across all languages and contexts is paramount. This study is a step in the right direction, highlighting the complexities involved and paving the way for more robust and equitable AI systems.

NOT YET AGI?

LLM Safety: Study Reveals Complex Interference in Multilingual AI