基于大语言模型的安全权重识别与保护

陈颖霓; 张伟娟; 刘泽艺; 党帅; 刘超

引用本文：

陈颖霓,张伟娟,刘泽艺,党帅,刘超.基于大语言模型的安全权重识别与保护[J].信息安全学报,已采用 [点击复制]
chen ying ni,zhang wei juan,liu ze yi,dang shuai,liu chao.Safety Weight Identification and Protection Based on Large Language Models[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 134次下载 0次
基于大语言模型的安全权重识别与保护
陈颖霓, 张伟娟, 刘泽艺, 党帅, 刘超
0 字体:加大+\|默认\|缩小-
(信息工程研究所)

摘要:

大语言模型（Large Language Models, LLMs）在人工智能的各类应用中扮演着日益重要的角色。为了针对特定任务进行优化或应对资源限制，人工智能服务提供商常常调整这些模型的权重，以提升模型的性能。然而，研究人员指出，这些优化方法可能会无意中削弱模型的安全性特征，降低其整体安全性，使其更容易受到恶意输入的攻击，从而导致生成有害或危险的内容。因此，如何在提升模型性能的同时保障其安全性，已成为一个亟待解决的关键问题。为此，本文提出了一个面向基于大语言模型的安全权重识别与保护框架。该框架首先引入了一种基于敏感度评估的安全权重识别算法，该算法通过扰动模型权重并评估其对安全相关任务的敏感性，识别出对模型安全至关重要的权重。结合top-p机制后，算法能够有效识别出对模型安全性至关重要的权重。接下来，框架设计了针对基于这些识别出的安全权重的大语言模型安全能力保护策略，确保在模型调整过程中，这些关键的安全相关权重不被破坏或丢失，从而保障模型的安全性。这些保护策略主要针对两种常见的权重调整场景：安全微调和安全剪枝。为了验证框架的有效性，研究人员在三个模型（Llama-3-8B-Instruct、Qwen2-7B-Instruct和Gemma-2-9b-it）上进行了实验。实验结果表明，该框架能够有效识别并保护关键的安全相关权重，确保模型在进行微调和剪枝时的安全性，并且不会牺牲性能。具体来说，该框架将正常剪枝后的Qwen模型的攻击成功率从69.39%降低至13.94%，将标准微调后的Llama模型的攻击成功率从69.33%降低至19.39%。这些结果表明，所提出的框架能够在提升模型性能的同时，显著增强其安全性，提供了一种平衡的模型优化方案。

关键词: 大语言模型模型安全关键权重微调技术剪枝技术

DOI：

投稿时间：2025-02-17修订日期：2025-09-02

基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

Safety Weight Identification and Protection Based on Large Language Models

chen ying ni, zhang wei juan, liu ze yi, dang shuai, liu chao

(Institute of Information Engineering)

Abstract:

Large Language Models (LLMs) play an increasingly important role in various AI applications. To optimize for specific tasks or cope with resource limitations, AI service providers often adjust the weights of these models to enhance their performance. However, researchers have pointed out that these optimization methods may unintentionally weaken the model's security features, reducing its overall security and making it more vulnerable to attacks from malicious inputs, thus leading to the generation of harmful or dangerous content. Therefore, how to improve the model's performance while ensuring its security has become a critical issue that needs to be addressed. To this end, this paper proposes a security weight identification and protection framework for large language models. The framework first introduces a security weight identification algorithm based on sensitivity evaluation, which identifies the weights critical to model security by perturbing the model's weights and evaluating their sensitivity to security-related tasks. After combining with the top-p mechanism, the algorithm can effectively identify the weights crucial for the model's security. Next, the framework designs security protection strategies for large language models based on these identified security-critical weights to ensure that these key security-related weights are not damaged or lost during the model adjustment process, thereby ensuring the model's security. These protection strategies focus on two common weight adjustment scenarios: secure fine-tuning and secure pruning. To validate the effectiveness of the framework, experiments were conducted on three models (Llama-3-8B-Instruct, Qwen2-7B-Instruct, and Gemma-2-9b-it). The experimental results show that the framework can effectively identify and protect key security-related weights, ensuring the security of the model during fine-tuning and pruning without sacrificing performance. Specifically, the framework reduced the attack success rate of the pruned Qwen model from 69.39% to 13.94%, and the attack success rate of the standard fine-tuned Llama model from 69.33% to 19.39%. These results indicate that the proposed framework can significantly enhance the model's security while improving its performance, providing a balanced model optimization solution.

Key words: large language models model safety critical weights fine-tune technology prune technology