大语言模型安全：风险分类、防御分析与评估方法综述

李方圆; 毛锐; 黄伟庆; 王妍; 刘力豪

引用本文：

李方圆,毛锐,黄伟庆,王妍,刘力豪.大语言模型安全：风险分类、防御分析与评估方法综述[J].信息安全学报,已采用 [点击复制]
Li Fangyuan,Mao Rui,Huang Weiqing,Wang Yan,Liu Lihao.Large Language Model Security: Overview of Risk Classification, Defense Analysis, and Evaluation Methods[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 54次下载 0次
大语言模型安全：风险分类、防御分析与评估方法综述
李方圆, 毛锐, 黄伟庆, 王妍, 刘力豪
0 字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所)

摘要:

随着大语言模型在多样化智能系统中的快速部署，其潜在的安全隐患与不可控行为愈发凸显，已成为学术界与产业界共同关注的核心议题。本文围绕大语言模型的安全性问题，从风险识别、防御分析与安全评估三个方面展开研究。首先，提出了一个由“内生风险”与“外部利用风险”构成的双层分类框架。内生风险深入探究模型自身脆弱性，从根本上追溯风险成因；外部利用风险则关注模型在真实对抗环境下所暴露出的安全问题，揭示其在实际部署和应用中可能引发的现实危害。在此基础上，本文梳理并凝练了针对大语言模型安全风险的主流防御技术，涵盖数据治理、生成过程干预及对抗防护等多个层面。同时归纳了不同风险场景下的防护思路，对各类防御方法的有效性与内在局限性进行了批判性分析，以推动构建结构清晰、适应动态威胁的安全防护框架。随后，为了对大语言模型的安全性进行有效评估，本文引入安全对齐的三大核心评估维度，系统阐述了各维度在不同安全目标下的评估侧重点与实现方法，并详细介绍了TruthfulQA、AdvBench、ToxiGen等一系列主流安全基准数据集的特点与应用场景。最后，本文总结了当前大语言模型安全领域尚未解决的问题，并对未来研究方向进行了展望。整体而言，本研究以风险识别、防御分析与安全评估为主线，构建了系统性的研究脉络，为后续研究与工程实践提供结构化的分析框架与参考路径。

关键词: 大语言模型风险分析防御措施安全评估人工智能安全

DOI：

投稿时间：2025-12-30修订日期：2026-04-29

基金项目:

Large Language Model Security: Overview of Risk Classification, Defense Analysis, and Evaluation Methods

Li Fangyuan, Mao Rui, Huang Weiqing, Wang Yan, Liu Lihao

(Institute of Information Engineering，Chinese Academy of Sciences)

Abstract:

With the rapid deployment of large language models in diverse intelligent systems, their potential security risks and uncontrollable behaviors have become increasingly prominent, emerging as a core concern for both academia and industry. This paper investigates the security of large language models from three complementary perspectives: risk identification, defense analysis, and security evaluation. Firstly, we propose a dual-layer risk taxonomy composed of intrinsic risks and external exploitation risks. Intrinsic risks focus on the inherent vulnerabilities of the models themselves, tracing the root causes of security issues at a fundamental level, whereas external exploitation risks emphasize security problems exposed in realistic adversarial settings, highlighting the potential real-world harms that may arise during deployment and practical use.Building upon this taxonomy, the paper systematically reviews and distills mainstream defense techniques for mitigating security risks in large language models, covering multiple layers such as data governance, intervention in the generation process, and adversarial protection. Meanwhile, defense strategies across different risk scenarios are summarized, and the effectiveness as well as inherent limitations of existing defense methods are critically analyzed, with the aim of facilitating the construction of a clear and adaptable security protection framework capable of addressing dynamic threats.Furthermore, to enable effective security evaluation of large language models, this paper introduces three core evaluation dimensions grounded in safety alignment. It systematically elaborates on the evaluation emphases and implementation methods associated with different security objectives, and provides a detailed discussion of representative security benchmark datasets, including TruthfulQA, AdvBench, and ToxiGen, with respect to their characteristics and application scenarios. Finally, the paper summarizes the unresolved challenges in the field of large language model security and outlines directions for future research. Overall, by taking risk identification, defense analysis, and security evaluation as its central thread, this study establishes a systematic research trajectory and provides a structured analytical framework and reference path for subsequent research and engineering practice.

Key words: large language models risk analysis defense techniques security evaluation artificial intelligence safety