引用本文
  • 刘世龙,胡明昊,耿国桐,罗威,柳林,张俊,王勇杰,王明,张子扬,王少煌,武成龙,刘畅,王嘉欣,赵士轲,徐先涛.基于混合迭代策略优化的大模型中文安全对齐防护[J].信息安全学报,已采用    [点击复制]
  • liushilong,huminghao,gengguotong,luowei,liulin,zhangjun,wangyongjie,wangming,zhangziyang,wangshaohuang,wuchenglong,liuchang,wangjiaxin,zhaoshike,xuxiantao.Chinese Safety Alignment and Defense of Large Models via Hybrid Iterative Strategy Optimization[J].Journal of Cyber Security,Accept   [点击复制]
【打印本页】 【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 140次   下载 0  
基于混合迭代策略优化的大模型中文安全对齐防护
刘世龙1, 胡明昊1, 耿国桐1, 罗威1, 柳林2, 张俊1, 王勇杰1, 王明1, 张子扬3, 王少煌1, 武成龙4, 刘畅2, 王嘉欣3, 赵士轲3, 徐先涛1
0
(1.军事科学院军事科学信息研究中心;2.国防科技大学计算机学院;3.北方工业大学信息学院;4.国防科技大学大数据与决策实验室)
摘要:
近年来,生成式大语言模型(LLMs)因其强大的智能能力和巨大的应用潜力吸引了众多研究者和企业的关注。然而,随着LLMs的通用性、潜在的主导地位以及应用深度和广度的不断提升,它们也面临着前所未有的安全风险。最近,一些针对LLMs的中文“越狱攻击”通过恶意提示诱导模型生成违反伦理和法律规定的内容。为了应对这种恶意攻击并提升LLMs的安全性,学术界已经开展了一些将LLMs与人类安全偏好相对齐的研究,但这些工作仍存在诸多局限性。首先,当前LLMs安全对齐方法主要面向英文场景,而由于中文与英文在语言结构和文化价值观上存在系统性差异,直接迁移英文对齐方法会导致中文安全模式误触发和价值观传递偏差等现象;此外,先前离线RLHF方法由于对“提示—响应”空间的覆盖程度低,面临着分布外泛化问题,导致对齐效果不佳。本文提出新型自举扩增指令增强方法构建了优质的中文SFT数据集、越狱攻击防御数据集和安全偏好数据集,并训练了中文安全增强的奖励模型,然后提出一种新型混合迭代策略优化方法对LLMs进行了中文安全对齐。实验表明,所训奖励模型在RewardBench的各项指标表现优异,尤其是“safety”指标名列前茅;对齐后的模型防护成功率从43.7%提升至99.4%,过度安全水平从34%增加至74%,优于参数更多的模型;此外,该对齐方法不影响模型原有能力,且具有通用性。
关键词:  越狱攻击  中文安全对齐  有监督微调  奖励模型  策略优化
DOI:
投稿时间:2024-12-30修订日期:2025-03-25
基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目)
Chinese Safety Alignment and Defense of Large Models via Hybrid Iterative Strategy Optimization
liushilong1, huminghao1, gengguotong1, luowei1, liulin2, zhangjun1, wangyongjie1, wangming1, zhangziyang3, wangshaohuang1, wuchenglong4, liuchang2, wangjiaxin3, zhaoshike3, xuxiantao1
(1.Center of Information Research, PLA Academy of Military Science;2.College of Computer Science, National University of Defense Technology;3.School of Information, North China University of Technology;4.Laboratory for Big Data and Decision, National University of Defense Technology)
Abstract:
In recent years, generative large language models (LLMs) have attracted significant attention from researchers and enter-prises due to their remarkable intelligence and vast application potential across various domains. However, as the univer-sality, potential dominance, and the depth and breadth of applications of LLMs continue to increase, they are also facing unprecedented security risks that could undermine their effectiveness. Recently, certain Chinese “Jailbreak attacks” have exploited malicious prompts to induce models to generate content that violates ethical and legal regulations, posing seri-ous challenges to their safe deployment. To address these malicious attacks and enhance the safety of LLMs, the aca-demic community has initiated research on aligning LLMs with human safety preferences, but these efforts still face nu-merous limitations and challenges. First, current safety alignment research predominantly focuses on English scenarios. Due to systematic differences in linguistic structures and cultural values between Chinese and English, directly migrating English alignment methods can lead to mis-triggered safety patterns and value transmission biases in Chinese contexts. Furthermore, previous offline RLHF approaches struggle with insufficient coverage of the “prompt–response” space, re-sulting in out-of-distribution generalization issues and suboptimal alignment outcomes. To address these challenges, this paper introduces a novel Bootstrapped Amplification and Instruction Enhancement method to construct high-quality Chinese supervised fine-tuning (SFT) datasets, Jailbreak-defense datasets, and safety preference datasets, and subse-quently trains a Chinese safety-enhanced reward model. We then propose a Hybrid Iterative Strategy Optimization method to achieve effective Chinese safety alignment for LLMs. Experimental results show that the trained reward model excels across various RewardBench metrics, ranking among the top, especially in the “safety” dimension. The aligned model raises its Protection Success Rate from 43.7% to an impressive 99.4% and increases its Excessive Safety Level from 34% to 74%, outperforming models with larger parameter counts. Moreover, this alignment method preserves the model’s original capabilities and demonstrates broad applicability across different tasks and scenarios.
Key words:  Jailbreak-attacks  Chinese safety alignment  supervised fine-tuning  reward model  strategy optimization