基于参数转换的语音深度伪造及其对声纹认证的威胁评估

苗晓孔; 孙蒙; 张雄伟; 李嘉康; 张星昱

【打印本页】【下载PDF全文】【View/Add Comment】【Download reader】【 Close 】

本文已被：浏览 6687次下载 5738次	码上扫一扫！
基于参数转换的语音深度伪造及其对声纹认证的威胁评估
苗晓孔,孙蒙,张雄伟,李嘉康,张星昱
分享到：微信更多字体:加大+\|默认\|缩小-
(陆军工程大学指挥控制工程学院智能信息处理实验室江苏南京 210007)

摘要:

声纹认证系统作为一种生物认证或识别机制，在人们的日常生活中得已经到了广泛应用。但目前该系统在实际应用中容易受到欺骗攻击，还存在一定的风险。语音转换通常是指将一个人的声音个性化特征参数通过“修改变换”，使之听起来像另外一个人的声音，同时保持说话内容信息不变的技术，用语音转换可生成特定目标说话人的语音，并在听觉感知上难以区分转换语音和目标语音。但是对于声纹认证系统来说，听觉上感知的相似有时还不足以欺骗认证系统。本文通过分析语音转换和声纹认证过程中所提取共同特征向量——梅尔倒谱，通过采用改进深度残差的双向长短时记忆网络对联合动态特征的梅尔倒谱实现更准确转换，同时改变损失函数优化转换网络性能并引入全局均值滤波滤除转换过程中产生的倒谱杂波，进而整体提升转换语音的质量。在提升语音转换相似度的同时保证主观感知不下降，并将转换后的语音用于欺骗两个广为采用的声纹认证系统，欺骗实验表明，该系统能够成功地欺骗这些认证系统，并且具有很高的成功率。

关键词: 语音转换声纹认证对抗攻击深度学习

DOI：10.19363/J.cnki.cn10-1380/tn.2020.11.05

Received:December 30, 2019Revised:March 05, 2020

基金项目:本课题得到江苏省自然科学基金（No.BK20180080）资助。

Deep Speech Forgery Based on Parameter Transformation and Threat Assessment to Voiceprint Authentication

MIAO Xiaokong,SUN Meng,ZHANG Xiongwei,LI Jiakang,ZHANG Xingyu

College of Command and Control Engineering Intelligent Information Processing Laboratory, Army Engineering University, Nanjing 210007, China

Abstract:

Automatic Speaker verification (ASV) system, as a biometric authentication or recognition mechanism, has been widely used in people’s daily life. However, the system is vulnerable to deception attack in practical application, and the system also faces different potential risks. Voice conversion (VC) usually refers to the technology of “modifying and transforming” a person’s voice characteristics to make it sound like another person’s voice, while keeping the speech content information unchanged. VC could generate the voice of a specific target speaker, and it is difficult to distinguish the converted voice and the target voice in auditory perception. But for the speaker verification system, the auditory similarity is not enough to cheat the authentication system. This paper analyzes Mel cepstrum, a common feature vector extracted in speech conversion and speaker verification, and realizes more accurate conversion of Mel cepstrum with joint dynamic features by using a two-way long and short-time memory network with improved depth residuals. At the same time, the loss function is changed to optimize the performance of the conversion network and the global mean filter is introduced to filter out the cepstrum clutter generated in the conversion process and improve the quality of the converted voice as a whole. At the same time, the similarity of speech conversion is improved and the subjective perception is not decreased. And the converted voice is used to cheat two different speaker verification systems. Experiments show that the system can successfully cheat these authentication systems, and has a high success rate.

Key words: voice conversion voiceprint authentication anti-attack deep learning