A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5
Fudan University
arXiv:2601.10527 [cs.AI], (15 Jan 2026 (v1)
@misc{ma2026a,
title={A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5},
author={Xingjun Ma and Yixu Wang and Hengyuan Xu and Yutao Wu and Yifan Ding and Yunhan Zhao and Zilong Wang and Jiabin Hua and Ming Wen and Jianan Liu and Ranjie Duan and Yifeng Gao and Yingshui Tan and Yunhao Chen and Hui Xue and Xin Wang and Wei Cheng and Jingjing Chen and Zuxuan Wu and Bo Li and Yu-Gang Jiang},
year={2026},
eprint={2601.10527},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.10527}
}
The rapid evolution of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has driven major gains in reasoning, perception, and generation across language and vision, yet whether these advances translate into comparable improvements in safety remains unclear, partly due to fragmented evaluations that focus on isolated modalities or threat models. In this report, we present an integrated safety evaluation of six frontier models–GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5–assessing each across language, vision-language, and image generation using a unified protocol that combines benchmark, adversarial, multilingual, and compliance evaluations. By aggregating results into safety leaderboards and model profiles, we reveal a highly uneven safety landscape: while GPT-5.2 demonstrates consistently strong and balanced performance, other models exhibit clear trade-offs across benchmark safety, adversarial robustness, multilingual generalization, and regulatory compliance. Despite strong results under standard benchmarks, all models remain highly vulnerable under adversarial testing, with worst-case safety rates dropping below 6%. Text-to-image models show slightly stronger alignment in regulated visual risk categories, yet remain fragile when faced with adversarial or semantically ambiguous prompts. Overall, these findings highlight that safety in frontier models is inherently multidimensional–shaped by modality, language, and evaluation design–underscoring the need for standardized, holistic safety assessments to better reflect real-world risk and guide responsible deployment.
February 23, 2026 by hgpu
Your response
You must be logged in to post a comment.





