基于噪声破坏和波形重建的声纹对抗样本防御方法

魏春雨; 孙蒙; 张雄伟; 邹霞; 印杰

本文已被：浏览 6521次下载 4549次	码上扫一扫！
基于噪声破坏和波形重建的声纹对抗样本防御方法
魏春雨,孙蒙,张雄伟,邹霞,印杰
分享到：微信更多字体:加大+\|默认\|缩小-
(陆军工程大学指挥控制工程学院南京中国 210007;江苏警官学院南京中国 210031)

摘要:

语音是人类最重要的交流方式之一。语音信号中除了文本内容外,还包含了说话人的身份、种族、年龄、性别和情感等丰富的信息,其中说话人身份的识别也被称为声纹识别,是一种生物特征识别技术。声纹具有获取方便、容易保存、使用简单等特点,而深度学习技术的进步也极大地促进了识别准确率的提升,因此,声纹识别已被应用于智慧金融、智能家居、语音助手和司法调查等领域。另一方面,针对深度学习模型的对抗样本攻击受到了广泛关注,在输入信号中添加不可感知的微小扰动即可导致模型预测结果错误。对抗样本的出现对基于深度学习的声纹识别也将造成巨大的安全威胁。现有声纹对抗样本防御方法会不同程度地影响正常样本的识别,并且局限于特定的攻击方法或识别模型,鲁棒性较差。为了使对抗防御能够兼顾纠正错误输出和准确识别正常样本两个方面,本文提出一种“破坏+重建”的两阶段对抗样本防御方法。第一阶段,在对抗样本中添加具有一定信噪比幅度限制的高斯白噪声,破坏对抗扰动的结构进而消除样本的对抗性。第二阶段,利用提出的名为SCAT-Wave-U-Net的语音增强模型重建原始语音样本,通过在Wave-U-Net模型结构中引入Transformer全局多头自注意力和层间交叉注意力机制,使改进后的模型更有助于防御声纹对抗样本攻击。实验表明,提出的防御方法不依赖于特定声纹识别系统和对抗样本攻击方式,在两种典型的声纹识别系统下对多种类型对抗样本攻击的防御效果均优于其他预处理防御方法。

关键词: 声纹识别噪声破坏语音增强对抗样本防御

DOI：10.19363/J.cnki.cn10-1380/tn.2024.01.05

投稿时间：2022-05-08修订日期：2022-07-06

基金项目:本课题得到江苏省优秀青年基金(No. BK20180080)和国家自然科学基金(No. 62371469, No. 62071484)资助。

Defense of Speaker Recognition Against Adversarial Examples Based on Noise Destruction and Waveform Reconstruction

WEI Chunyu,SUN Meng,ZHANG Xiongwei,ZOU Xia,YIN Jie

College of Command and Control Engineering, Army Engineering University of PLA, Nanjing 210007, China;Jiangsu Police Institute, Nanjing 210031, China

Abstract:

Voice is one of the most import ways of human communications. Besides texts, voice signals also hold the information of the speaker’s identity, race, age, gender, and emotion, where the recognition of speaker identity is also called speaker recognition which is a biometric technique. Given the fact that human voice is easy to be collected and saved, and that the development of deep learning improves the recognition accuracy, speaker recognition has been used in financial APP authentication, smart home, voice assistant and forensics. On the other hand, adversarial attacks against deep learning models have attracted great attention, which could make the models’ predictions incorrect by adding imperceptible perturbations to input signals. Therefore, the emergence of adversarial examples also poses the same serious security threat to deep learning-based speaker recognition. In this paper, a two-stage method with “destructing” and “reconstructing” is proposed to defense against adversarial examples of speaker recognition by overcoming the shortcomings of existing defense methods, such as the inability to remove adversarial perturbations, the negative impacts on the recognition of normal examples, and the poor robustness to different models and attack methods. At the first stage, Gaussian noises with a certain range of SNR amplitudes are added to the input speech signal to destroy the structure of potential adversarial perturbations and to eliminate its adversarial function. At the second stage, the proposed speech enhancement model named SCAT-Wave-U-Net is used to reconstruct the original clean speech. Global multi-head self-attention of Transformer and interlayer cross-attention mechanisms are introduced into the Wave-U-Net structure, which is more useful for defending the speaker adversarial examples. Experimental results show that the effectiveness of the proposed defense method does not depend on the specific speaker recognition system and the adversarial example attack method. By conducting extensive experiments on two state-of-the-art speaker recognition systems, i.e., i-vector and x-vector, the performances of the defense against multiple types of adversarial examples are superior to other defense methods using preprocessing techniques.

Key words: speaker recognition noise destruction speech enhancement defense of adversarial examples