引用本文: |
-
吴宗儒,程彭洲,张倬胜,刘功申.面向语言模型的文本后门防御综述[J].信息安全学报,已采用 [点击复制]
- Wu Zongru,Cheng Pengzhou,Zhang Zhuosheng,Liu Gongshen.A Survey on Textual Backdoor Defense for Language Models[J].Journal of Cyber Security,Accept [点击复制]
|
|
摘要: |
近年来,语言模型发展迅速,并在自然语言处理的多个领域得到广泛应用,展现出远超传统方法的性能。然而,语言模型复杂的结构和庞大的参数规模导致其工作机理难以解释,以后门攻击为代表的一系列安全威胁降低了语言模型的可靠性,限制了语言模型的推广。尽管针对语言模型的后门防御已有诸多研究,但大多数方法仍局限于传统训练范式,难以应对生成式预训练大语言模型的后门防御需求。此外,已有的文本后门防御方案缺乏统一分类标准,相关文献综述尚不全面,且对防御方案的对比分析不足。为系统总结相关研究,并为后续的相关研究提供有价值的参考,本文对前沿的文本后门防御方案进行总结和对比。首先,根据防御措施的实施阶段和防御方的目标需求,本文将目前主流的文本后门防御方案分为训练阶段防御(包括后门权重移除、正则化训练与数据集净化)和测试阶段防御(包括离线模型检测、在线输入检测和正则化解码),并介绍各类防御方案的代表性工作;随后,列举文本后门防御领域不同任务的常用数据集和评价指标;之后,结合主流的评价指标,综合分析主流文本后门防御方案对防御者能力的要求、计算开销以及其抵御主流文本后门攻击方法的防御性能,总结主流方案的局限性;最后,基于上述分析,本文展望文本后门防御领域的未来研究方向,包括探索通用防御方案、设计适用于生成式大语言模型的防御方案、探究多语种环境下的文本后门防御方案、开展文本后门的可解释性研究以及搭建文本后门防御评测平台。 |
关键词: 文本后门防御 人工智能安全 自然语言处理 语言模型 预训练语言模型 大语言模型 |
DOI: |
投稿时间:2024-10-21修订日期:2024-12-25 |
基金项目:国家自然科学基金联合重点项目(U21B2020)、科技创新2030“新一代人工智能”重大专项(2022ZD0120304)、国家自然科学基金青年项目(62406188) |
|
A Survey on Textual Backdoor Defense for Language Models |
Wu Zongru, Cheng Pengzhou, Zhang Zhuosheng, Liu Gongshen
|
(School of Cyber Science and Engineering, Shanghai Jiao Tong University) |
Abstract: |
Language models (LMs) have seen rapid development and are widely deployed across diverse natural language processing (NLP) domains, consistently demonstrating state-of-the-art performance. However, the complex archi-tecture and massive parameter scales of LMs limit their interpretability. Consequently, a range of security threats, particularly backdoor attacks, challenge the reliability and trustworthiness of LMs, impeding their wider deploy-ment. While extensive research aims at defending against backdoor attacks on LMs, most existing methods remain confined to conventional training paradigms, making them ineffective for generative large language models (LLMs). Additionally, current classification standards for textual backdoor defense are inconsistent, and existing reviews either lack comprehensive coverage of the literature or provide insufficient comparative analyses of defenses. To address these gaps and offer valuable insights for future research, this paper systematically reviews and compares a wide range of textual backdoor defenses. Based on the implementation stage and the purpose of the defenders, we categorize the mainstream textual backdoor defense methods into training-stage defense (including trojan weight removal, regularized training, and dataset purifying), and testing-stage defense (including offline model inspection, online input inspection, and regularized decoding). Representative works from each category are subsequently highlighted. Furthermore, this paper summarizes the commonly used datasets and evaluation metrics for textual backdoor defense. By integrating evaluation metrics, we comprehensively analyze the capability requirements of defenders, computational overhead, and defense performance against prevalent textual backdoor attack methods, identifying key limitations of existing defenses. Lastly, we outline future research directions, including developing general defense frameworks, designing tailored defenses for generative LLMs, investigating multilingual defense, exploring the interpretability of textual backdoors, and establishing benchmarks for evaluating backdoor defenses. |
Key words: textual backdoor defense artificial intelligence security natural language processing language models pre-trained language models large language models |