基于扩散净化的语音命令识别对抗防御方法

徐成龙; 杨吉斌; 张雄伟; 张强

引用本文：

徐成龙,杨吉斌,张雄伟,张强.基于扩散净化的语音命令识别对抗防御方法[J].信息安全学报,已采用 [点击复制]
Xuchenglong,YangJibin,Zhangxiongwei,Zhangqiang.An adversarial defense method based on diffusion purification for speech command recognition[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 307次下载 0次
基于扩散净化的语音命令识别对抗防御方法
徐成龙, 杨吉斌, 张雄伟, 张强
0 字体:加大+\|默认\|缩小-
(陆军工程大学指挥控制工程学院)

摘要:

语音命令识别（Speech Command Recognition, SCR）技术在智能家居、智能汽车和语音助手等领域得到了广泛应用。然而，基于深度神经网络的SCR系统在面对对抗样本攻击时表现出明显的脆弱性。对抗样本攻击通过在输入语音中引入微小扰动，可能导致系统错误地激活或执行错误的指令，严重威胁实际应用场景中的安全性。为增强SCR系统应对对抗样本的鲁棒性，本文提出了一种基于扩散模型的净化预处理方法AttenPure。该方法通过扩散过程对输入语音信号进行预处理，削弱对抗扰动和平稳噪声的影响。为提升方法性能，AttenPure利用带状注意力机制强化对语音有效成分的关注，减少全局注意力带来的计算冗余和干扰信息，从而增强关键特征的提取能力与对扰动的抵抗能力。同时结合多分辨率扩散步门控机制，进一步加强模型对语音细节的恢复能力。此外，设计了双峰动态权重函数，以优化AttenPure训练过程中扰动破坏和语音恢复这两个关键扩散阶段的学习效果。在Google Speech Commands真实人声数据集和Synthetic Speech Commands机器合成语音数据集上的实验结果表明，在面对FGSM、PGD和GAN等多种扰动攻击时，AttenPure方法在两种数据集上的平均鲁棒准确率分别达到86.21%和85.52%，与先进方法AudioPure相比，分别提高了4.10%和3.14%，单条命令的平均推理时长缩减16.4%。将该方法迁移至不同架构的识别器，并在低信噪比的空中传输场景中进行测试，结果表明AttenPure能在多种应用场景下显著提升主流 SCR 系统的鲁棒性和准确率。

关键词: 语音命令识别对抗样本语音净化扩散模型带状注意力

DOI：

投稿时间：2025-01-11修订日期：2025-06-23

基金项目:面向声纹欺诈反制的说话人溯源研究,No. 62371469,国家自然科学基金项目（面上项目，重点项目，重大项目）；场景自适应智能语音增强研究,No. 62071484,国家自然科学基金项目（面上项目，重点项目，重大项目）

An adversarial defense method based on diffusion purification for speech command recognition

Xuchenglong, YangJibin, Zhangxiongwei, Zhangqiang

(School of Command and Control Engineering, Army Engineering University)

Abstract:

Speech Command Recognition (SCR) technology has been widely applied in fields such as smart homes, smart cars, and voice assistants. However, SCR systems based on deep neural networks exhibit significant vulnerability to adversarial sample attacks. These attacks introduce minor perturbations into the input speech, which may cause the system to mis-takenly activate or execute incorrect commands, posing serious security threats in practical application scenarios. In order to enhance the robustness of SCR systems against adversarial samples, this paper proposes a purification preprocessing method based on diffusion models, named AttenPure. This method preprocesses input speech signals through a diffusion process to mitigate adversarial perturbations and stationary noise. To improve the performance of the method, AttenPure utilizes a band attention mechanism to strengthen the attention to the effective components of the speech, reducing computational overhead and interference introduced by global attention, thereby enhancing the ability to extract key features and the robustness against disturbance. Meanwhile, combined with the multi-resolution diffusion step gating mechanism, the model's ability to restore speech details is further enhanced. In addition, a bimodal dynamic weighting function is designed to optimize the two key diffusion stages of the AttenPure training process, perturbation destruction and speech recovery. Experimental results on the Google Speech Commands real human voice dataset and Synthetic Speech Commands machine synthesized speech dataset show that the AttenPure method is more effective than other methods under various perturbation attacks such as FGSM, PGD and GAN. The average robust accuracy on the two data sets reaches 86.21% and 85.52%, which is 4.10% and 3.14% higher than the state-of-the-art (SOTA) baseline AudioPure, and the average reasoning time of a single command is reduced by 16.4%. We further transferred AttenPure to different recognition architectures and evaluated it under low signal-to-noise ratio (SNR) over-air transmission conditions. Results indicate that AttenPure significantly enhances both robustness and accuracy of mainstream SCR systems across diverse application scenarios.

Key words: speech command recognition adversarial example speech purification diffusion model band attention