引用本文: |
-
曹明明,雷震春,杨印根,周勇.多阶GMM-ResNet融合在语音伪造检测中的研究[J].信息安全学报,2025,10(2):116-126 [点击复制]
- CAO Mingming,LEI Zhenchun,YANG Yingen,ZHOU Yong.Research on Multi-order GMM-ResNet Fusion for Speech Deepfake Detection[J].Journal of Cyber Security,2025,10(2):116-126 [点击复制]
|
|
|
|
本文已被:浏览 144次 下载 42次 |
 码上扫一扫! |
多阶GMM-ResNet融合在语音伪造检测中的研究 |
曹明明, 雷震春, 杨印根, 周勇
|
|
(江西师范大学计算机信息工程学院 南昌 中国 330022) |
|
摘要: |
近年来,自动说话人识别技术取得了显著进步,但同时也容易受到合成或转换语音的伪造攻击,语音伪造检测系统致力于解决这一问题。本文根据不同阶数GMM中高斯分量之间的相关性和ResNet模型中不同层次残差块输出的特征信息,提出了一种多阶GMM-ResNet融合模型进行语音伪造检测。该模型主要包含两部分:多阶对数高斯概率(Log Gaussian Probability,LGP)特征融合和多尺度特征聚合ResNet (Multi-Scale Feature Aggregation ResNet,MFA-ResNet)。GMM描述了语音特征在其空间的分布情况,不同阶数的GMM则具有不同描述能力来形成对特征分布的平滑近似。此外,根据不同阶数GMM计算出来的LGP特征也就在不同阶上捕获语音信息。多阶LGP特征融合将基于不同阶数的GMM得到的三种不同阶LGP特征进行加权融合,从而促进不同阶LGP特征之间的信息交换。另一方面,神经网络模型中第一层或中间层获得的特征信息对于分类任务也是非常有用的。基于这一经验,MFA-ResNet模块通过对每个ResNet块输出的特征进行聚合,充分融合网络内不同层级的特征信息,从而提高网络的特征提取能力。在ASVspoof 2019逻辑访问场景下,LFCC+多阶GMM-ResNet融合系统的min t-DCF和EER分别为0.0353和1.16%,比基线系统LFCC+GMM分别相对降低了83.3%和85.7%。在ASVspoof 2021逻辑访问场景下,LFCC+多阶GMM-ResNet融合系统的min t-DCF和EER分别为0.2459和2.50%,比基线系统LFCC+GMM分别相对降低了57.3%和87.1%,比基线系统LFCC+LCNN分别相对降低了28.6%和73.0%。与目前最先进模型相比,本文模型也非常具有竞争力。 |
关键词: 多阶GMM-ResNet融合 多阶对数高斯概率特征融合 多尺度特征聚合 语音伪造检测 |
DOI:10.19363/J.cnki.cn10-1380/tn.2025.03.08 |
投稿时间:2023-08-03修订日期:2023-11-28 |
基金项目:本课题得到国家自然科学基金(No. 62067004), 江西省教育厅科学技术研究项目(No. GJJ2200331)资助。 |
|
Research on Multi-order GMM-ResNet Fusion for Speech Deepfake Detection |
CAO Mingming, LEI Zhenchun, YANG Yingen, ZHOU Yong
|
(School of Computer and Information Engineering, Jiangxi Normal University, Nanchang 330022, China) |
Abstract: |
Automatic speaker verification technology has made remarkable progress in recent years, but it is also vulnerable to deepfake attacks by synthesized or converted speech. Therefore, the speech deepfake detection systems have been developed to address this issue. In this paper, we propose a fusion model that combines multi-order GMMs with ResNet for speech deepfake detection. The model leverages the correlation between Gaussian components in different order GMMs and the feature maps of all residual blocks in the ResNet model. The multi-order GMM-ResNet fusion model mainly consists of two parts: multi-order Log Gaussian Probability (LGP) Feature fusion and Multi-scale Feature Aggregation ResNet (MFA-ResNet). The conventional GMM describes the distribution of speech features in the feature space, and different order GMMs have different descriptive abilities to form smooth approximations to the feature distribution. Additionally, the multi-order LGP features are based on the different order GMMs, which also capture the speech information at different scales. The multi-order LGP feature fusion module weights three order LGP features from different order GMMs and facilitates information exchange between them. On the other hand, the feature information obtained in the first or intermediate layers in neural network model is also very useful for classification tasks. Based on this experience, the MFA-ResNet module aggregates all outputs from ResNet blocks, and all feature maps can also contribute towards the accurate speech embedding extraction. On the ASVspoof 2019 logical access task, the LFCC+multi-order GMM-ResNet fusion system achieves a minimum t-DCF of 0.0353 and an EER of 1.16%, which relatively reduces by 83.3% and 85.7% compared with the LFCC+GMM baseline. On the ASVspoof 2021 logical access task, the LFCC+multi-order GMM-ResNet fusion system achieves a minimum t-DCF of 0.2459 and an EER of 2.50%, which relatively reduces by 57.3% and 87.1% compared with the LFCC+GMM baseline, and relatively reduces by 28.6% and 73.0% compared with the LFCC+ LCNN baseline. Compared with current state-of-the-art models, the proposed model is competitive. |
Key words: multi-order GMM-ResNet fusion multi-order log-gaussian probability feature fusion multi-scale feature aggregation speech deepfake detection |
|
|
|
|
|