| 引用本文: |
-
和椿皓,芦天亮,李文政,王文浩,邓钰洋,彭舒凡,赵凯.一种结合可变形卷积与最大特征图的录音重放语音检测方法[J].信息安全学报,已采用 [点击复制]
- HE Chunhao,LU Tianliang,LI Wenzheng,Wang Wenhao,DENG Yuyang,PENG Shufan,ZHAO Kai.A Recording-Replay Speech Detection Method Combining Deformable Convolution and Max Feature Map[J].Journal of Cyber Security,Accept [点击复制]
|
|
| 摘要: |
| 录音重放是一种经典且实现成本低、隐蔽性强的语音伪造方式,在物理访问(Physical Access, PA)场景下对自动说话人验证系统和语音认证应用构成了现实而持续的安全威胁。针对复杂声学环境中重放语音伪造痕迹细微、易受设备与环境因素影响的问题,提出一个结合可变形卷积网络(Deformable Convolutional Networks, DCNs)与最大特征图(Max Feature Map, MFM)改进的Res2Net模型(Deformable MFM-based Res2Net, DeRes2MFM-Net)用于录音重放语音检测。该模型利用可变形卷积的空间自适应采样的特点,对特征图中微小的局部差异进行有效捕获,同时结合最大特征图的通道选择机制,实现对多通道特征的有效筛选与增强。在多尺度特征建模方面,模型以嵌套多尺度结构为基础,进一步引入通道洗牌策略,以增强跨通道的信息交互与特征融合,从而提升对重放语音中隐藏的欺骗特征的识别能力。通过在ASVspoof 2019和ASVspoof 2021两个公开数据集上的实验验证表明,所提模型在等错误率(Equal Error Rate, EER)和最小串联检测代价函数(minimum tandem Detection Cost Function, min t-DCF)指标上均优于多种对比方法,充分证明了DeRes2MFM-Net在检测录音重放攻击中的有效性与鲁棒性。 |
| 关键词: 录音重放 伪造语音检测 多尺度特征提取 可变形卷积 改进残差网络 通道洗牌 |
| DOI: |
| 投稿时间:2025-10-25修订日期:2026-01-16 |
| 基金项目:2024年度北京社科基金规划项目(24FXC017);中央高校基本科研业务费专项资金(No.2025bsky024) |
|
| A Recording-Replay Speech Detection Method Combining Deformable Convolution and Max Feature Map |
|
HE Chunhao, LU Tianliang, LI Wenzheng, Wang Wenhao, DENG Yuyang, PENG Shufan, ZHAO Kai
|
| (People’s Public Security University of China) |
| Abstract: |
| Replay attack is a classical and widely recognized voice spoofing technique that is characterized by its low implementa-tion cost and strong concealment properties, which together pose a realistic and persistent security threat to automatic speaker verification systems as well as various voice-based authentication applications in physical access (PA) scenarios. To address the specific challenge that replayed speech in complex acoustic environments often contains extremely subtle spoofing traces, which can be easily affected by the characteristics of playback and recording devices as well as environ-mental factors, an improved Res2Net-based model that integrates both Deformable Convolutional Networks (DCNs) and the Max Feature Map (MFM) mechanism is proposed, referred to as the Deformable MFM-based Res2Net, or DeRes2MFM-Net, specifically designed for the detection of replayed speech. The proposed model effectively exploits the spatially adaptive sampling capability inherent in deformable convolution to capture fine-grained local variations and minute differences in the feature maps, while simultaneously leveraging the channel selection mechanism provided by Max Feature Map to effectively filter, highlight, and enhance multi-channel features that are more relevant to accurately detecting replay artifacts. In terms of multi-scale feature modeling, the model is constructed upon a nested multi-scale architecture and further incorporates a channel shuffle strategy, which facilitates stronger cross-channel information in-teraction and more effective feature fusion across different scales, ultimately further improving the network’s ability to identify hidden and subtle spoofing cues within replayed speech signals. Extensive experiments carried out on two publicly available benchmark datasets, namely ASVspoof 2019 and ASVspoof 2021, demonstrate that the proposed DeRes2MFM-Net consistently outperforms a variety of baseline and state-of-the-art comparative methods in terms of both Equal Error Rate (EER) and the minimum tandem Detection Cost Function (min t-DCF), thereby fully validating the overall effectiveness, robustness, and practical applicability of the proposed approach for detecting replay attacks in challenging real-world physical access scenarios. |
| Key words: replay attack spoofed speech detection multi-scale feature extraction deformable convolution improved residual net-work channel shuffle |