基于最大均值差异的特征解纠缠后门缓解方法

古津榜; 洪征; 张国敏; 江川

引用本文：

古津榜,洪征,张国敏,江川.基于最大均值差异的特征解纠缠后门缓解方法[J].信息安全学报,已采用 [点击复制]
GuJinbang,HONG Zheng,ZHANG Guomin,JIANG Chuan.Backdoor Mitigation Based on Feature Disentanglement with Maximum Mean Discrepancy[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 913次下载 0次
基于最大均值差异的特征解纠缠后门缓解方法
古津榜¹, 洪征¹, 张国敏², 江川²
0 字体:加大+\|默认\|缩小-
(1.中国人民解放军陆军工程大学指挥控制工程学院;2.中国人民解放军陆军工程大学指挥控制工程学院)

摘要:

后门攻击通过修改数据集或模型参数向神经网络植入后门，对神经网络的安全构成了严重的威胁。后门缓解方法旨在清除模型中潜在的后门，防止输入样本触发后门机制输出错误的分类结果，从而确保神经网络的整体安全。针对现有后门缓解方法适应性差、性能不佳等问题，提出了一种基于最大均值差异的特征解纠缠后门缓解方法。后门攻击的实现依赖于特定的神经元，这些神经元在不同网络层次的分布往往存在差异，通过在各层次采用不同的掩码比例，在保留模型正常特征提取能力的基础上，破坏模型中后门与目标标签之间的潜在关联，实现后门缓解。此外，考虑到不同神经网络在特征提取模块方面存在差异，但分类器模块通常具有高度的相似性，针对分类器模块进行特征解纠缠，在特征空间中有效分离正常特征与后门特征削弱后门特征的影响，同时重构良性特征与标签之间的联系实现后门缓解。在Datacon2023-AI安全数据集上进行的实验表明，所提出的后门缓解方法在干净样本分类准确率仅降低4.08%的情况下，使嵌入触发器样本的攻击成功率降低了88.65%，优于现有的后门缓解方法。

关键词: 后门攻击模型安全后门缓解特征解纠缠

DOI：

投稿时间：2025-03-10修订日期：2025-06-16

基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

Backdoor Mitigation Based on Feature Disentanglement with Maximum Mean Discrepancy

GuJinbang¹, HONG Zheng¹, ZHANG Guomin², JIANG Chuan²

(1.Command and Control Engineering College, Army Engineering University of PLA;2.Department of Command and Control Engineering, Army Engineering University of PLA)

Abstract:

Backdoor attacks pose a significant threat to the security of neural networks by inserting backdoors into the models through dataset manipulation or parameter modification. Backdoor mitigation methods aim to remove potential back-doors from models and prevent input samples from triggering backdoor mechanisms that produce incorrect classifica-tion results, thereby ensuring the overall security of neural networks. To address the limitations of existing backdoor mitigation methods, such as poor adaptability and suboptimal performance, a Backdoor Mitigation method Based on Feature Disentanglement with Maximum Mean Discrepancy is proposed. The implementation of backdoor attacks relies on specific neurons, whose distribution across different network layers may vary. By applying different mask initialization ratios at various layers, the method disrupts the potential association between backdoors and target labels while preserving the model's normal feature extraction capability, thereby achieving backdoor mitigation. Additionally, considering that different neural networks vary in their feature extraction modules but typically share high similarity in their classifier modules, the proposed method focuses on feature disentanglement in the classifier module. The method separates benign features from backdoor features in the feature space, further weakening the influence of backdoor features. Meanwhile the association between benign features and their corresponding labels is reconstructed, thereby achieving comprehensive backdoor mitigation. Experiments conducted on the Datacon2023-AI dataset demonstrate that the proposed backdoor mitigation method reduces the attack success rate of trigger-embedded samples by 88.65%, with only a 4.08% reduction in clean sample classification accuracy, out-performing existing backdoor mitigation methods.

Key words: Backdoor Attack Model Security Backdoor Mitigation Feature Disentanglement