大语言模型的安全威胁与防御综述

贾澄钰; 陈晋音; 许淦; 张奥; 张鹤; 金海波; 陈若曦; 郑海斌

引用本文：

贾澄钰,陈晋音,许淦,张奥,张鹤,金海波,陈若曦,郑海斌.大语言模型的安全威胁与防御综述[J].信息安全学报,已采用 [点击复制]
Jia Chengyu,Chen Jinyin,Xu Gan,Zhang Ao,Zhang He,Jin Haibo,Chen Ruoxi,Zheng Haibin.A Survey on Large Language Model’s Security Risks and Defense Methods[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 14439次下载 0次
大语言模型的安全威胁与防御综述
贾澄钰, 陈晋音, 许淦, 张奥, 张鹤, 金海波, 陈若曦, 郑海斌
0 字体:加大+\|默认\|缩小-
(浙江工业大学信息工程学院)

摘要:

ChatGPT、Bard等大型语言模型（Large language models, LLMs）技术的发展对人工智能社区产生深远影响。这些模型具备卓越的语言理解、类人的文本生成和强大的问题解决能力，在搜索引擎、金融、医疗和自动驾驶等领域呈现广泛应用前景。然而，LLMs的使用揭示了一系列安全漏洞，引起了研究人员对其安全性问题的关注。本文从自然语言处理（Natural language processing, NLP）和安全的视角，对LLMs的安全性进行了全面调查。首先概述了LLMs的相关框架平台及其发展历程，从模型架构、训练数据来源和下游对齐方法三个角度对目前国内外主流的LLMs进行分类。接着对现有的LLMs安全综述工作开展讨论，将这些工作根据评估维度、单一安全维度和防御方法三个维度进行划分，并对其进行归纳和总结。随后讨论了LLMs在使用全周期（语料库收集及数据预处理、模型预训练阶段、下游对齐阶段和模型推理阶段）中可能面临的安全威胁。进一步将这些威胁进行了详细的分类和总结，将可信评估划分为幻觉、欺骗、毒性、隐私、偏见和鲁棒性六个方面，并讨论和总结了越狱、后门和对抗三种针对LLMs的攻击形式。还总结了针对LLMs开发和使用过程中的道德隐患问题。本文进一步概述了一系列针对LLMs安全威胁的防御和检测措施，重点增强模型抵御幻觉、隐私泄露、偏见等威胁的能力。最后，讨论了减轻LLMs安全风险的主要挑战和新出现的机遇，为研究人员、从业者和政策制定者在大语言模型的复杂应用和研究领域提供指导建议。

关键词: 大语言模型幻觉欺骗毒性越狱后门隐私公平偏见鲁棒性防御检测

DOI：

投稿时间：2024-01-10修订日期：2024-06-27

基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目），浙江省自然科学基金

A Survey on Large Language Model’s Security Risks and Defense Methods

Jia Chengyu, Chen Jinyin, Xu Gan, Zhang Ao, Zhang He, Jin Haibo, Chen Ruoxi, Zheng Haibin

(College of Information Engineering, Zhejiang University of Technology)

Abstract:

The development of large language models (LLMs) such as ChatGPT and Bard has had a profound impact on the artifi-cial intelligence community. These models have excellent language understanding, human-like text generation, and pow-erful problem-solving capabilities, and have shown broad application prospects in search engines, finance, medical care, and autonomous driving. However, the use of LLMs has revealed a series of security vulnerabilities, which has attracted the attention of researchers to their security issues. This paper conducts a comprehensive investigation on the security of LLMs from the perspective of natural language processing (NLP) and security. First, the relevant framework platforms and development history of LLMs are outlined, and the current mainstream LLMs at home and abroad are classified from three perspectives: model architecture, training data source, and downstream alignment method. Then, the existing LLMs security review work is discussed, and these works are divided according to the three dimensions of evaluation di-mension, single security dimension, and defense method, and summarized and concluded. Then, the security threats that LLMs may face in the full cycle of use (corpus collection and data preprocessing, model pretraining stage, downstream alignment stage, and model inference stage) are discussed. These threats are further classified and summarized in detail, and the trusted evaluation is divided into six aspects: hallucination, deception, toxicity, privacy, bias and robustness. Three forms of attack against LLMs, jailbreaking, backdoors and adversarial attacks, are discussed and summarized. The ethical risks in the development and use of LLMs are also summarized. This paper further outlines a series of defense and detection measures against LLMs security threats, focusing on enhancing the model's ability to resist threats such as hal-lucinations, privacy leaks, and bias. Finally, the main challenges and emerging opportunities for mitigating LLMs security risks are discussed, providing guidance and suggestions for researchers, practitioners, and policymakers in the complex application and research fields of large language models.

Key words: large language models, hallucinations, deception, toxicity, Jailbreaks, backdoor, privacy, fairness, bias, robustness, defense, detection