融合全局时序和局部空间特征的伪造人脸视频检测方法

陈鹏; 梁涛; 刘锦; 戴娇; 韩冀中

引用本文：

陈鹏,梁涛,刘锦,戴娇,韩冀中.融合全局时序和局部空间特征的伪造人脸视频检测方法[J].信息安全学报,2020,5(2):73-83 [点击复制]
CHEN Peng,LIANG Tao,LIU Jin,DAI Jiao,HAN Jizhong.Forged Facial Video Detection Based on Global Temporal and Local Spatial Feature[J].Journal of Cyber Security,2020,5(2):73-83 [点击复制]

本文已被：浏览 10731次下载 12020次	码上扫一扫！
融合全局时序和局部空间特征的伪造人脸视频检测方法
陈鹏^1,2, 梁涛^1,2, 刘锦^1,2, 戴娇¹, 韩冀中¹
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所北京中国 100093;2.中国科学院大学网络空间安全学院北京中国 100093)

摘要:

近年来，深度学习在人工智能领域表现出优异的性能。基于深度学习的人脸生成和操纵技术已经能够合成逼真的伪造人脸视频，也被称作深度伪造，让人眼难辨真假。然而，这些伪造人脸视频可能会给社会带来巨大的潜在威胁，比如被用来制作政治虚假新闻，从而引发政治暴力或干扰正常选举等。因此，亟需研发对应的检测方法来主动发现伪造人脸视频。现有的方法在制作伪造人脸视频时，容易在空间上和时序上留下一些细微的伪造痕迹，比如纹理和颜色上的扭曲或脸部的闪烁等。主流的检测方法同样采用深度学习，可以被划分为两类，即基于视频帧的方法和基于视频片段的方法。前者采用卷积神经网络（Convolutional Neural Network，CNN）发现单个视频帧中的空间伪造痕迹，后者则结合循环神经网络（Recurrent Neural Network，RNN）捕捉视频帧之间的时序伪造痕迹。这些方法都是基于图像的全局信息进行决策，然而伪造痕迹一般存在于五官的局部区域。因而本文提出了一个统一的伪造人脸视频检测框架，利用全局时序特征和局部空间特征发现伪造人脸视频。该框架由图像特征提取模块、全局时序特征分类模块和局部空间特征分类模块组成。在FaceForensics++数据集上的实验结果表明，本文所提出的方法比之前的方法具有更好的检测效果。

关键词: 伪造人脸深度伪造人脸检测视频检测时序特征空间特征

DOI：10.19363/J.cnki.cn10-1380/tn.2020.02.06

投稿时间：2020-01-15修订日期：2020-02-29

基金项目:

Forged Facial Video Detection Based on Global Temporal and Local Spatial Feature

CHEN Peng^1,2, LIANG Tao^1,2, LIU Jin^1,2, DAI Jiao¹, HAN Jizhong¹

(1.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100093, China)

Abstract:

Nowadays, deep learning has shown impressive performance in the field of artificial intelligence. Face generation and manipulation techniques based on deep learning have enabled the creation of sophisticated forged facial video, also known as Deepfakes, which is indistinguishable by human eyes. However, these forged videos will bring huge potential threats to our society, such as being used to make fake political news, which will incite political violence or sabotage elections. Therefore, there is an urgent need to develop effective methods for forged facial video detection. When producing a forged facial video, existing methods are prone to exhibit some spatial and temporal subtle traces, such as distortion of color and texture or facial temporal flickering. The mainstream detection methods also adopt deep learning, which can be divided into frame-based and fragment-based. The former exploits Convolutional Neural Network (CNN) to find spatial traces in single frame. The latter combines Recurrent Neural Network (RNN) to capture temporal traces between frames. These methods are based on the global information of image to make decision, but the minor traces generally exist in the local area of face. Thus, we propose a unified detection framework for forged facial video, which exploits global temporal and local spatial feature to discover manipulated facial videos. It consists of an image feature extractor, a global temporal feature classifier and a local spatial feature classifier. The experimental results on FaceForensics++ demonstrate that our proposed method achieves better performance than previous methods.

Key words: forged face Deepfakes face detection video detection temporal feature spatial feature