【打印本页】      【下载PDF全文】   查看/发表评论  下载PDF阅读器  关闭
←前一篇|后一篇→ 过刊浏览    高级检索
本文已被:浏览 28次   下载 24 本文二维码信息
码上扫一扫!
基于WavLM特征解相关的深度伪造语音检测方法
王帅斌,易小伟,刘长军,苏小苏,曹纭,吕美杨
分享到: 微信 更多
(中国科学院信息工程研究所 北京 中国 100085;中国科学院大学网络空间安全学院 北京 中国 100049)
摘要:
随着语音合成与转换技术的成熟及应用,高质量的伪造语音足以欺骗人类听觉感知和说话人验证系统,深度伪造语音技术的恶意利用对个人财产安全和社会稳定产生了严重的威胁。近年来,深度伪造语音检测研究受到了广泛关注,并且在特定数据集上获得了很好的检测效果。然而,已有检测方法在跨域的通用伪造特征提取方面存在局限性,以及语音特征之间存在统计相关性会误导模型学习到与语音检测任务无关的特征,导致模型在跨域场景下的性能严重下降。本文提出了一种基于WavLM特征解相关的深度伪造语音检测方法,该方法首先提出了一个基于自监督预训练WavLM模型和图注意力网络结合的WavLMAST模型,利用WavLM模型提取语音的声学层、内容层和语义层特征,再结合基于图注意力的后端网络进一步建模语音的自适应时频域特征,这种设计增强了模型对深度伪造语音中微妙伪影的表示能力。然后,通过动态调整训练样本的特征相关度权重对WavLMAST模型提取的多层特征解相关,使模型更关注与伪造语音检测任务相关的特征,从而提高其在跨域检测场景下的泛化能力。实验结果表明,本文方法在ASVspoof 2019 logical access (LA)和ASVspoof 2021 LA数据集上比最先进的Mixture of Experts方法的等错误率分别降低了40.5%和36.8%。
关键词:  深度伪造语音  语音合成  伪造语音检测  泛化性  ASVspoof数据集
DOI:10.19363/J.cnki.cn10-1380/tn.2025.11.15
投稿时间:2025-06-30修订日期:2025-11-15
基金项目:本课题得到国家自然科学基金项目(No.62272456)资助。
A Deepfake Speech Detection Method Based on WavLM Feature Decorrelation
WANG Shuaibin,YI Xiaowei,LIU Changjun,SU Xiaosu,CAO Yun,LYU Meiyang
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100085, China;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:
With the maturation and increasing application of speech synthesis and voice conversion technologies, high-quality deepfake speech has become capable of deceiving both human auditory perception and automated speaker verification systems. The malicious use of deepfake speech technology therefore poses a substantial threat to personal financial security as well as social stability. In recent years, deepfake speech detection has attracted significant research attention and has achieved strong performance on certain benchmark datasets. However, existing detection approaches still exhibit limitations in extracting universal and domain-invariant forgery features. Furthermore, statistical correlations among multi-level speech features may mislead models into learning patterns that are irrelevant to the detection task, resulting in severe performance degradation when deployed in cross-domain scenarios. To address these challenges, this paper proposes a deepfake speech detection method based on WavLM feature decorrelation. First, we introduce a hybrid architecture termed WavLMAST, which integrates a self-supervised pre-trained WavLM model with a graph attention network. The WavLM component is utilized to extract acoustic-level, content-level, and semantic-level speech representations. These representations are then fed into a graph-attention-based backend, which adaptively models the time–frequency domain characteristics of speech. This design enhances the model’s capability to represent subtle artifacts associated with deepfake speech. Subsequently, a dynamic feature decorrelation strategy is applied to the multi-layer representations extracted by WavLMAST. By adjusting the feature-correlation weights of training samples, the proposed method reduces statistical dependencies among features and encourages the model to focus on attributes that are directly relevant to the deepfake speech detection task. This mechanism significantly improves the model’s generalization ability in cross-domain detection scenarios. Experimental results demonstrate the effectiveness of the proposed approach. On the ASVspoof 2019 Logical Access (LA) and ASVspoof 2021 LA datasets, the proposed method reduces the Equal Error Rate by 40.5% and 36.8%, respectively, compared with the state-of-the-art Mixture of Experts approach.
Key words:  deepfake speech  speech synthesis  deepfake speech detection  generalization  ASVspoof dataset