引用本文: |
-
张桐,邓俊龙,任延珍,王丽娜.基于视觉Transformer的鲁棒伪造语音检测算法[J].信息安全学报,已采用 [点击复制]
- Zhang Tong,Deng Junlong,Ren Yanzhen,Wang Lina.Robust Fake Audio Detection Algorithm based on Vision Transformer[J].Journal of Cyber Security,Accept [点击复制]
|
|
摘要: |
随着深度学习和语音合成技术的飞速发展,语音深度伪造技术(Deepfake)使得伪造语音在自然度和情感等方面达到了逼真水平,给社会安全带来了极大的威胁。为了抵御这些伪造技术给安全和隐私带来的威胁,基于深度学习的伪造语音检测技术得到了研究者们的极大关注,并已经取得了较好的性能,但是还存在鲁棒性和可解释性不佳的问题。具体来说,在训练数据与实际检测数据之间存在各种失配时,现有检测技术的性能会显著降低,并且没有针对检测结果进行判决分析,缺少可解释性。针对目前多种数据失配场景下现有伪造语音检测技术性能不佳的问题,本文提出了一种基于视觉Transformer的鲁棒伪造语音检测方案,从前端特征提取和后端神经网络两个方面优化整个检测算法。在特征提取方面,本文提出了基于自监督学习的前端特征提取器,通过使用标记数据对已有的通用预训练模型进行微调,学习到更鲁棒的中间语音特征。在后端神经网络方面,本文提出了将视觉Transformer应用到伪造语音检测任务,将原本的位置编码分解成时间位置编码和频率位置编码,利用Transformer架构的表征能力,学习到更鲁棒的特征表示,以捕获待检测语音中的伪影。实验结果表明,在多种复杂的数据失配场景下,本文提出的方法对比其他现有方法的检测EER分别降低了1%-20%,具有更好的鲁棒性。此外,本文利用了Transformer模型的注意力机制,对伪造语音检测模型的决策过程进行了解释性分析,具有良好的应用价值。 |
关键词: 深度伪造 伪造语音检测 数据失配 鲁棒检测 可解释性 |
DOI: |
投稿时间:2024-08-26修订日期:2024-12-24 |
基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目) |
|
Robust Fake Audio Detection Algorithm based on Vision Transformer |
Zhang Tong, Deng Junlong, Ren Yanzhen, Wang Lina
|
(School of Cyber Science and Engineering, Wuhan University) |
Abstract: |
With the rapid development of deep learning and speech synthesis technology, speech deepfake technology has made fake speech realistic in terms of naturalness and emotion, posing a great threat to social security. In order to resist the threats to security and privacy brought by these fake technologies, fake speech detection technology based on deep learning has received great attention from researchers and has achieved good performance, but there are still problems with robustness and poor interpretability. Performance degradation occurs significantly when there is a mismatch between training data and actual detection data, and there is a lack of interpretability as existing detection techniques do not provide analysis of detection results. Addressing the issue of poor performance of existing deepfake speech detection techniques under various data mismatch scenarios, this paper proposes a robust speech deepfake detection scheme based on Vision Transformer, which optimizes the entire detection algorithm from both frontend feature extraction and backend neural network aspects. In terms of feature extraction, this paper introduces a frontend feature extractor based on self-supervised learning, which fine-tunes existing generic pre-trained models using labeled data to learn better intermediate speech representations. For the backend neural network, this paper extends Vision Transformer to deepfake speech detection task, decomposing the original positional encoding into time positional encoding and frequency positional encoding. Leveraging the powerful representation capability of Transformer architecture, better feature representations are learned to capture artifacts in the speech to be detected. Ex-periments indicate that in various complex data mismatch scenarios, our method reduces the detection EER (Equal Error Rate) by 1% to 20% compared to existing methods, exhibiting improved robustness. Additionally, the paper utilizes the attention mechanism of Transformer model to provide interpretability analysis of the decision-making process of the deepfake speech detection model, thus possessing significant practical value. |
Key words: deepfake, fake audio detection, data mismatch, robust detection, interpretability |