基于视觉Transformer的鲁棒伪造语音检测算法

张桐; 邓俊龙; 任延珍; 王丽娜

引用本文：

张桐,邓俊龙,任延珍,王丽娜.基于视觉Transformer的鲁棒伪造语音检测算法[J].信息安全学报,2026,11(2):21-31 [点击复制]
ZHANG Tong,DENG Junlong,REN Yanzhen,WANG Lina.Robust Fake Audio Detection Algorithm based on Vision Transformer[J].Journal of Cyber Security,2026,11(2):21-31 [点击复制]

本文已被：浏览 24次下载 16次	码上扫一扫！
基于视觉Transformer的鲁棒伪造语音检测算法
张桐¹, 邓俊龙¹, 任延珍^1,2, 王丽娜^1,2
0 字体:加大+\|默认\|缩小-
(1.武汉大学国家网络安全学院武汉中国 430072;2.空天信息安全与可信计算教育部重点实验室武汉中国 430072)

摘要:

在飞速发展的信息时代和数据时代,网络攻击对个人隐私、工作生活乃至生命财产安全带来了严重威胁。而主机作为人类进行日常工作交流、生活娱乐、数据存储的重要设备,成为网络攻击的主要目标。因此,进行主机攻击发现技术的研究是紧迫且必要的,而主机事件作为记录主机中一切行为的载体,成为当今网络攻防领域的重点研究对象。攻击者在主机中的各种恶意操作会不可避免地被记录为主机事件,但恶意事件隐藏在规模庞大的正常事件中难以察觉和筛选,引发了如何获取主机事件、如何识别并提取恶意事件、如何还原攻击过程、如何进行安全防护等一系列问题的学术研究。本文对基于主机事件的攻击发现技术相关研究进行了广泛的调研和细致的汇总,对其研究发展历程进行了梳理,并将本文所研究的基于主机事件的攻击发现技术与入侵检测、数字取证两大研究方向从分析对象、分析方法、作用时间、分析目的4个方面进行了对比,阐明了本文所研究问题的独特之处,并对其下定义。随后,本文对基于主机事件的攻击发现技术涉及的关键概念进行了解释,提出了该领域面临的依赖关系爆炸和及时性两大问题,并将研究按照阶段划分为主机事件采集、主机事件处理、主机事件分析三个类别,分别介绍了三个类别围绕两大问题共计12个细分方向的研究成果和进展,最后结合研究现状提出了主机事件记录的完整性和可信性、攻击发现的时效性、跨设备的攻击发现、多步骤攻击的发现、算法的运用等5个未来可能的研究方向。

关键词: 深度伪造伪造语音检测数据失配鲁棒检测可解释性

DOI：10.19363/J.cnki.cn10-1380/tn.2026.03.02

投稿时间：2024-08-26修订日期：2024-12-24

基金项目:该工作受到国家自然科学基金(NSFC)(No.62572358,No.62172306,No.62372334)的支持。

Robust Fake Audio Detection Algorithm based on Vision Transformer

ZHANG Tong¹, DENG Junlong¹, REN Yanzhen^1,2, WANG Lina^1,2

(1.School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China;2.Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan 430072, China)

Abstract:

With the rapid development of deep learning and speech synthesis technology, speech deepfake technology has made fake speech realistic in terms of naturalness and emotion, posing a great threat to social security. In order to resist the threats to security and privacy brought by these fake technologies, fake speech detection technology based on deep learning has received great attention from researchers and has achieved good performance, but there are still problems with robustness and poor interpretability. Performance degradation occurs significantly when there is a mismatch between training data and actual detection data, and there is a lack of interpretability as existing detection techniques do not provide analysis of detection results. Addressing the issue of poor performance of existing deepfake speech detection techniques under various data mismatch scenarios, this paper proposes a robust speech deepfake detection scheme based on Vision Transformer, which optimizes the entire detection algorithm from both frontend feature extraction and backend neural network aspects. In terms of feature extraction, this paper introduces a frontend feature extractor based on self-supervised learning, which fine-tunes existing generic pre-trained models using labeled data to learn better intermediate speech representations. For the backend neural network, this paper extends Vision Transformer to deepfake speech detection task, decomposing the original positional encoding into time positional encoding and frequency positional encoding. Leveraging the powerful representation capability of Transformer architecture, better feature representations are learned to capture artifacts in the speech to be detected. Experiments indicate that in various complex data mismatch scenarios, our method reduces the detection EER (Equal Error Rate) by 1% to 20% compared to existing methods, exhibiting improved robustness. Additionally, the paper utilizes the attention mechanism of Transformer model to provide interpretability analysis of the decision-making process of the deepfake speech detection model, thus possessing significant practical value.

Key words: deepfake fake audio detection data mismatch robust detection interpretability