引用本文
  • 刘斯鸿,雷震春,钱广源,周勇,刘长红.基于预训练模型跨层注意力注入机制的语音伪造检测方法[J].信息安全学报,已采用    [点击复制]
  • liu si hong,lei zhen chun,qian guang yuan,zhou yong,liu chang hong.Cross-Layer Attention Injection from Pre-Trained Models for Speech Deepfake Detection[J].Journal of Cyber Security,Accept   [点击复制]
【打印本页】 【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 52次   下载 0  
基于预训练模型跨层注意力注入机制的语音伪造检测方法
刘斯鸿, 雷震春, 钱广源, 周勇, 刘长红
0
(江西师范大学)
摘要:
近年来,深度学习的快速进步推动了预训练模型的广泛应用,尤其在标注数据稀缺的情况下,其重要性愈发凸显。相关研究表明,源自这些模型的特征在语音深度伪造检测任务中具有出色的判别能力,为下游分类任务提供了更加稳健和丰富的信息表示。然而,大多数现有方法主要利用预训练模型最顶层特征,从而忽略了较低层和中间层中存在的潜在有价值的信息。尽管近期部分研究尝试在不同层之间传播或融合特征,但这些方法通常缺乏一种动态机制来促进顶层与底层之间的有效信息流动,限制了对预训练模型表征能力的充分挖掘。在本文中,我们提出了一种新颖的跨层注意力注入方法来进一步增强来自预训练模型特征的判别能力。具体而言,我们并不是简单地使用来自最顶层特征或采用递归融合策略,而是引入了一种基于注意力的机制。我们首先将最顶层特征选择性地传递到若干较低层,从而获得一系列语义增强特征表达。为了避免全层无差别传递带来的冗余,我们对顶层特征进行均值池化来获得全局语义表示,并通过注意力机制将其注入到这些语义增强特征中以实现有效指导。最终来自最顶层的特征和得到全局语义表示的特征进行门控融合后被输入到精心设计的后端分类器中。在后端,我们提出了具有多尺度并行建模能力的Des2Net_BiMamba模型,该模型采用了Res2Net的并行多分支结构,其中采用多个Des2Net层和双向Mamba块组成,它们分别用于多尺度特征提取和建模全局上下文信息。该架构旨在同时捕捉区分真实语音与伪造语音所需的细粒度和整体模式。我们在广泛使用的ASVspoof 2019逻辑访问、ASVspoof 2021逻辑访问、ASVspoof 2021深度伪造和IN-THE-WILD数据集上进行了全面实验,以评估我们提出的方法的有效性。实验结果表明,我们的方法在ASVspoof 2019逻辑访问、ASVspoof 2021逻辑访问、ASVspoof 2021深度伪造和IN-THE-WILD数据集上保持高度竞争力。这项研究结果验证了跨层注意力注入方法在语音伪造检测任务中充分发挥预训练模型潜力方面的有效性。消融实验和对比实验分别证明了我们的后端分类器的有效性和跨层注意力注入方法的有效性与便捷性。另外可视化分析进一步验证了我们方法在提升特征判别性方面的优势。总体而言,我们的研究结果凸显了跨层注意力注入在充分发挥预训练模型潜力方面的巨大潜力,所提出方法显著提升了语音伪造检测的鲁棒性和准确性,在多个主流的数据集上取得了优异表现。
关键词:  语音伪造检测  跨层注意力注入  预训练模型  Mamba  Transformer
DOI:
投稿时间:2025-06-30修订日期:2025-10-10
基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目)
Cross-Layer Attention Injection from Pre-Trained Models for Speech Deepfake Detection
liu si hong, lei zhen chun, qian guang yuan, zhou yong, liu chang hong
(Jiangxi Normal University)
Abstract:
In recent years, the rapid progress of deep learning has driven the widespread adoption of pre-trained models, whose significance has become increasingly prominent, especially in scenarios with limited labeled data. Studies have shown that features derived from these models exhibit remarkable discriminative power in speech deepfake detection tasks, providing more robust and informative representations for downstream classification. However, most existing methods mainly utilize features from the topmost layer of pre-trained models, thereby neglecting potentially valuable information present in lower and intermediate layers. Although some recent works have explored propagating or fusing features across different layers, these approaches typically lack a dynamic mechanism to promote effective information flow be-tween the top and bottom layers, thus limiting the full potential of pre-trained model representations. To address these limitations, we propose a novel Cross-Layer Attention Injection (CLAI) method to further enhance the discriminative capability of features derived from pre-trained models. Specifically, instead of simply using the topmost features or adopting a recursive fusion strategy, we introduce an attention-based mechanism. We first selectively transmit the top-most features to several lower layers, obtaining a series of semantically enhanced feature representations. To avoid the redundancy caused by indiscriminate transmission to all layers, we apply mean pooling to the topmost features to obtain a global semantic representation, which is then injected into these enhanced features via an attention mechanism for ef-fective guidance. The final features, obtained by gated fusion of the topmost features and the global semantic representa-tion, are then fed into the carefully designed backend classifier. In our backend classifier, we propose the Des2Net_BiMamba model, which incorporates parallel multi-scale modeling through a multi-branch architecture in-spired by Res2Net. The model consists of multiple Des2Net layers and bidirectional Mamba blocks, which are respec-tively used for multi-scale feature extraction and modeling global contextual information. This architecture is designed to capture both fine-grained and holistic patterns necessary for distinguishing between bona fide and spoofed speech.We conduct comprehensive experiments on the widely used ASVspoof 2019 Logical Access, ASVspoof 2021 Logical Access, Deepfake and IN-THE-WILD datasets to evaluate the effectiveness of our approach. Experimental results demonstrate that our method remains highly competitive on the ASVspoof 2019 Logical Access, ASVspoof 2021 Logical Access, Deepfake and IN-THE-WILD dataset. Ablation and comparative studies validate the contributions of the backend classi-fier and the CLAI mechanism, respectively. In addition, visualization analyses confirm the superiority of our method in enhancing the discriminability of feature representations. Overall, our findings underscore the great potential of cross-layer attention injection in fully leveraging pre-trained models, significantly improving the robustness and accu-racy of speech deepfake detection, and achieving outstanding performance on multiple benchmark datasets.
Key words:  spoof speech detection  cross-layer attention injection  Pre-trained models  Mamba  Transformer