| 引用本文: |
-
张桐,邓俊龙,任延珍,王丽娜.基于视觉Transformer的鲁棒伪造语音检测算法[J].信息安全学报,2026,11(2):21-31 [点击复制]
- ZHANG Tong,DENG Junlong,REN Yanzhen,WANG Lina.Robust Fake Audio Detection Algorithm based on Vision Transformer[J].Journal of Cyber Security,2026,11(2):21-31 [点击复制]
|
|
| 摘要: |
| 在飞速发展的信息时代和数据时代,网络攻击对个人隐私、工作生活乃至生命财产安全带来了严重威胁。而主机作为人类进行日常工作交流、生活娱乐、数据存储的重要设备,成为网络攻击的主要目标。因此,进行主机攻击发现技术的研究是紧迫且必要的,而主机事件作为记录主机中一切行为的载体,成为当今网络攻防领域的重点研究对象。攻击者在主机中的各种恶意操作会不可避免地被记录为主机事件,但恶意事件隐藏在规模庞大的正常事件中难以察觉和筛选,引发了如何获取主机事件、如何识别并提取恶意事件、如何还原攻击过程、如何进行安全防护等一系列问题的学术研究。本文对基于主机事件的攻击发现技术相关研究进行了广泛的调研和细致的汇总,对其研究发展历程进行了梳理,并将本文所研究的基于主机事件的攻击发现技术与入侵检测、数字取证两大研究方向从分析对象、分析方法、作用时间、分析目的4个方面进行了对比,阐明了本文所研究问题的独特之处,并对其下定义。随后,本文对基于主机事件的攻击发现技术涉及的关键概念进行了解释,提出了该领域面临的依赖关系爆炸和及时性两大问题,并将研究按照阶段划分为主机事件采集、主机事件处理、主机事件分析三个类别,分别介绍了三个类别围绕两大问题共计12个细分方向的研究成果和进展,最后结合研究现状提出了主机事件记录的完整性和可信性、攻击发现的时效性、跨设备的攻击发现、多步骤攻击的发现、算法的运用等5个未来可能的研究方向。 |
| 关键词: 深度伪造 伪造语音检测 数据失配 鲁棒检测 可解释性 |
| DOI:10.19363/J.cnki.cn10-1380/tn.2026.03.02 |
| 投稿时间:2024-08-26修订日期:2024-12-24 |
| 基金项目:该工作受到国家自然科学基金(NSFC)(No.62572358,No.62172306,No.62372334)的支持。 |
|
| Robust Fake Audio Detection Algorithm based on Vision Transformer |
|
ZHANG Tong1, DENG Junlong1, REN Yanzhen1,2, WANG Lina1,2
|
| (1.School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China;2.Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, Wuhan 430072, China) |
| Abstract: |
| With the rapid development of deep learning and speech synthesis technology, speech deepfake technology has made fake speech realistic in terms of naturalness and emotion, posing a great threat to social security. In order to resist the threats to security and privacy brought by these fake technologies, fake speech detection technology based on deep learning has received great attention from researchers and has achieved good performance, but there are still problems with robustness and poor interpretability. Performance degradation occurs significantly when there is a mismatch between training data and actual detection data, and there is a lack of interpretability as existing detection techniques do not provide analysis of detection results. Addressing the issue of poor performance of existing deepfake speech detection techniques under various data mismatch scenarios, this paper proposes a robust speech deepfake detection scheme based on Vision Transformer, which optimizes the entire detection algorithm from both frontend feature extraction and backend neural network aspects. In terms of feature extraction, this paper introduces a frontend feature extractor based on self-supervised learning, which fine-tunes existing generic pre-trained models using labeled data to learn better intermediate speech representations. For the backend neural network, this paper extends Vision Transformer to deepfake speech detection task, decomposing the original positional encoding into time positional encoding and frequency positional encoding. Leveraging the powerful representation capability of Transformer architecture, better feature representations are learned to capture artifacts in the speech to be detected. Experiments indicate that in various complex data mismatch scenarios, our method reduces the detection EER (Equal Error Rate) by 1% to 20% compared to existing methods, exhibiting improved robustness. Additionally, the paper utilizes the attention mechanism of Transformer model to provide interpretability analysis of the decision-making process of the deepfake speech detection model, thus possessing significant practical value. |
| Key words: deepfake fake audio detection data mismatch robust detection interpretability |