基于多级时空域3D卷积的换脸视频检测方法

包晗; 符皓程; 曹纭; 赵险峰; 汤朋

引用本文：

包晗,符皓程,曹纭,赵险峰,汤朋.基于多级时空域3D卷积的换脸视频检测方法[J].信息安全学报,2022,7(5):29-38 [点击复制]
BAO Han,FU Haocheng,CAO Yun,ZHAO Xianfeng,TANG Peng.Multi-scale Time-Spatial Domain Detection of Fabricated Face Video Based on 3D Convolution[J].Journal of Cyber Security,2022,7(5):29-38 [点击复制]

本文已被：浏览 4644次下载 3435次	码上扫一扫！
基于多级时空域3D卷积的换脸视频检测方法
包晗^1,2, 符皓程^1,2, 曹纭^1,2, 赵险峰^1,2, 汤朋^1,2
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所信息安全国家重点实验室北京中国 100093;2.中国科学院大学网络空间安全学院北京中国 100093)

摘要:

近年来，视频换脸技术发展迅速。该技术可被用于伪造视频来影响政治行动和获得不当利益，从而给社会带来严重危害，目前已经引起了各国政府和舆论的广泛关注。本文通过分析现有的主流视频换脸生成技术和检测技术，指出当前主流的生成方法在时域和空域中均具有伪造痕迹和生成损失。而当前基于神经网络检测合成人脸视频的算法大部分方法只考虑了空域的单幅图像特征，并且在实际检测中有明显的过拟合问题。针对目前检测方法的不足，本文提出一种高效的基于时空域结合的检测算法。该方法同时对视频换脸生成结果在空域与时域中的伪造痕迹进行捕捉，其中，针对单帧的空域特征设计了全卷积网络模块，该模块采用3D卷积结构，能够精确地提取视频帧阵列中每帧的伪造痕迹；针对帧阵列的时域特征设计了卷积长短时记忆网络模块，该模块能够检测伪造视频帧之间的时序伪造痕迹；最后，根据特征分类设计特征网络金字塔网络结构，该结构能够融合不同尺寸的时空域特征，通过多尺度融合来提高分类效果，并减少过拟合现象。与现有方法相比，该方法在训练中的收敛效果和分类效果方面有明显优势。除此之外，我们在保证检测准确率的前提下采用较少的参数，相比现有结构而言训练效率更高。

关键词: 视频换脸神经网络检测卷积长短时记忆网络特征网络金字塔

DOI：10.19363/J.cnki.cn10-1380/tn.2022.09.03

投稿时间：2019-12-31修订日期：2020-04-01

基金项目:本课题得到国家重点研发计划课题（No.2019QY2202，No.2020AAA0140000）的资助。

Multi-scale Time-Spatial Domain Detection of Fabricated Face Video Based on 3D Convolution

BAO Han^1,2, FU Haocheng^1,2, CAO Yun^1,2, ZHAO Xianfeng^1,2, TANG Peng^1,2

(1.State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100093, China)

Abstract:

Recently, Deepfake technology has developed rapidly, which can be used to forge videos. The abuse of such fake videos has caused serious harm to society and has now attracted widespread attention from governments and public opinion. Based on a thorough investigation, this paper figures out that the current mainstream generation methods have forgery traces and generation losses in both the temporal and spatial domains. However, most of the current algorithms for detecting fabricated face videos based on neural networks only consider the features of a single image in the spatial domain, and have overfitting problems, resulting in low accuracy in actual detection. In order to solve the mentioned shortcomings, this paper evaluates the state-of-the-art detection algorithms of the Deepfake face and proposes an effective detection algorithm based on the combination of spatial and temporal features. Our network considers both spatial and temporal features of the fabricated face video. As for the single frame in the video, we present a fully convolutional network to extract the spatial feature. This module adopts a 3D convolution structure, which can accurately extract the forgery traces of each frame in the video frame array. As for frame array, we build a module based on a convolutional network with Long Short-Term Memory (LSTM) for temporal feature extraction. This module is able to detect timing forgery traces between fake video frames. At last, we apply Feature Pyramid Networks (FPN) to improve the accuracy of face classification. This structure can fuse Time-Spatial features of different sizes. It can improve the classification effect through multi-scale fusion and reduce overfitting. Comparative experiments have demonstrated that the proposed method is more effective in terms of the performance of training convergence and classification accuracy. In addition, we adopt fewer parameters and achieve high detection accuracy, resulting in higher training efficiency compared with the existing methods.

Key words: deepfake videos neural network detection convolutional long and short-term memory feature pyramid networks