| 引用本文: |
-
隋润起,丁喆,杨文川,崔宝江,余浩然,马天骄.BreakBench:面向大语言模型的中文越狱攻击标准化评测基准[J].信息安全学报,已采用 [点击复制]
- Sui Runqi,Ding Zhe,Yang Wenchuan,Cui Baojiang,Yu Haoran,Ma Tianjiao.BreakBench: A Standardized Evaluation Benchmark for Jailbreak Attacks on Chinese Large Language Models[J].Journal of Cyber Security,Accept [点击复制]
|
|
| 摘要: |
| 大语言模型(Large Language Models, LLMs)展现出优越的文本理解与生成能力,但其易被越狱提示语绕过安全约束,进而生成违法或有害内容,已成为亟待应对的重大挑战,迫切需要标准化的安全评测基准。然而,现有越狱评测基准在恶意行为覆盖、攻击方法有效性与评测指标体系方面存在显著不足,难以全面衡量模型的安全脆弱性。为此,本文提出BreakBench,一个面向中文语境的标准化越狱攻击评测基准,旨在系统评估LLMs的潜在风险与安全机制鲁棒性。具体而言:(1) 本研究基于层级式恶意行为扩展策略,融合多源经验知识与LLMs生成能力,构建全面覆盖现实威胁场景的恶意行为数据集,共包含5?265条完整测试样例与740条精简学术子集;(2) 提出动态越狱方法R&EPrompt,在角色扮演的基础上引入情感强化机制,通过构建“恶意行为→角色特征→情感状态”的语义映射链路,精准关联与目标恶意行为适配的角色特征和情感状态,并通过语义层面的引导与约束,有效提升越狱提示语对模型安全防御机制的渗透能力;(3) 设计融合攻击强度与攻击开销指标的多维度评测体系,突破仅以攻击成功率进行评测的局限,支持对越狱攻击有效性、安全风险与资源代价的综合评估。基于BreakBench,本文对11款主流且具备中文能力的LLMs(涵盖开源与商用模型)开展系统实验,结果表明BreakBench在评测广度、深度及有效性方面均优于现有基准,并揭示了当前模型普遍存在越狱脆弱性及防御能力的差异性,为未来的安全对齐研究与防御策略设计提供了可量化的实证支持。 |
| 关键词: 越狱攻击 大语言模型 中文评测基准 安全对齐 |
| DOI: |
| 投稿时间:2025-09-04修订日期:2026-01-21 |
| 基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目)、北京邮电大学博士生创新基金 |
|
| BreakBench: A Standardized Evaluation Benchmark for Jailbreak Attacks on Chinese Large Language Models |
|
Sui Runqi1, Ding Zhe1, Yang Wenchuan1, Cui Baojiang1, Yu Haoran1, Ma Tianjiao2
|
| (1.Beijing University of Posts and Telecommunications;2.Beihang University) |
| Abstract: |
| Large Language Models (LLMs) demonstrate impressive capabilities in text comprehension and generation. However, their susceptibility to jailbreak prompts that bypass safety constraints and elicit harmful or illegal outputs has emerged as a pressing challenge. This underscores the urgent need for a standardized safety evaluation benchmark. Existing jailbreak benchmarks suffer from critical limitations in terms of harmful behavior coverage, attack effectiveness, and evaluation metrics, hindering a comprehensive assessment of model vulnerabilities. To address these gaps, we introduce BreakBench, a standardized jailbreak evaluation benchmark tailored for Chinese-language contexts, aimed at systematically assessing both the potential risks of LLMs and the robustness of their safety mechanisms. Specifically, (1) We construct a harmful behavior dataset grounded in a hierarchical expansion strategy that integrates multi-source knowledge with LLM generation capabilities. This dataset encompasses 5,265 complete test samples and a 740-sample academic subset, comprehensively covering real-world threat scenarios. (2) We propose R&EPrompt, a dynamic jailbreak method that enhances role-playing with emotional reinforcement mechanisms. By building a semantic mapping chain from harmful behavior to role characteristics and emotional states, this method enables precise alignment of roles and emotions with target malicious intents and enhancing their ability to penetrate model safety defenses. (3) We design a multidimensional evaluation framework that combines attack success rate with novel indicators of attack intensity and cost, overcoming the limitations of success-rate-only assessments and enabling a more comprehensive evaluation of jailbreak effectiveness, associated risks, and resource overhead. Leveraging BreakBench, we conduct a systematic evaluation of 11 mainstream LLMs with Chinese language capabilities, including both open-source and commercial models. Experimental results demonstrate that BreakBench significantly outperforms existing benchmarks in terms of evaluation breadth, depth, and effectiveness. Furthermore, our findings reveal widespread jailbreak vulnerabilities and varying levels of defense robustness across models, providing quantitative, empirical support for future research on alignment safety and defense strategy development. |
| Key words: jailbreak attack large language model Chinese evaluation benchmark safety alignment |