深度视频修复篡改的被动取证研究

熊义毛; 丁湘陵; 谷庆; 杨高波; 赵险峰

引用本文：

熊义毛,丁湘陵,谷庆,杨高波,赵险峰.深度视频修复篡改的被动取证研究[J].信息安全学报,2024,9(4):125-138 [点击复制]
XIONG Yimao,DING Xiangling,GU Qing,YANG Gaobo,ZHAO Xianfeng.The Passive Forensics of Deep Video Inpainting[J].Journal of Cyber Security,2024,9(4):125-138 [点击复制]

本文已被：浏览 1320次下载 954次	码上扫一扫！
深度视频修复篡改的被动取证研究
熊义毛¹, 丁湘陵^1,2,3, 谷庆¹, 杨高波⁴, 赵险峰^2,5
0 字体:加大+\|默认\|缩小-
(1.湖南科技大学计算机科学与工程学院湘潭中国 411201;2.中国科学院信息工程研究所信息安全国家重点实验室北京中国 100093;3.郑州信大先进技术研究院郑州中国 450000;4.湖南大学信息科学与工程学院长沙中国 410082;5.中国科学院大学网络空间安全学院北京中国 100093)

摘要:

深度视频修复技术就是利用深度学习技术, 对视频中的缺失区域进行补全或移除特定目标对象。它也可用于合成篡改视频, 其篡改后的视频很难通过肉眼辨别真假, 尤其是一些恶意修复的视频在社交媒体上传播时, 容易造成负面的社会舆论。目前,针对深度视频修复篡改的被动检测技术起步较晚, 尽管它已经得到一些关注, 但在研究的深度和广度上还远远不够。因此, 本文提出一种基于级联 ConvGRU 和八方向局部注意力的被动取证技术, 从时空域角度实现对深度修复篡改区域的定位检测。首先,为了提取修复区域的更多特征, RGB 帧和错误级分析帧 ELA 平行输入编码器中, 通过通道特征级融合, 生成不同尺度的多模态特征。其次, 在解码器部分, 使用编码器生成的多尺度特征与串联的 ConvGRU 进行通道级融合来捕捉视频帧间的时域不连续性。最后, 在编码器的最后一级 RGB 特征后, 引入八方向局部注意力模块, 该模块通过八个方向来关注像素的邻域信息, 捕捉修复区域像素间的异常。实验中, 本文使用了 VI、 OP、 DSTT 和 FGVC 四种最新的深度视频修复方法与已有的深度视频修复篡改检测方法 HPF 和 VIDNet 进行了对比, 性能优于 HPF 且在编码器参数仅 VIDNet 的五分之一的情况下获得与 VIDNet 可比的性能。结果表明, 本文所提方法利用多尺度双模态特征和引入的八方向局部注意力模块来关注像素间的相关性, 使用 ConvGRU捕捉时域异常, 实现像素级的篡改区域定位, 获得精准的定位效果。

关键词: 深度视频修复视频篡改检测级联 ConvGRU 局部注意力模块空时预测

DOI：10.19363/J.cnki.cn10-1380/tn.2024.07.08

投稿时间：2022-10-31修订日期：2023-04-12

基金项目:本课题得到国家自然科学基金(No. 62272160);信息安全国家重点实验室开放课题(No. 2021-ZD-07);河南省网络空间态势感知重点实验室开放课题基金资助(No. HNTS2022025)资助。

The Passive Forensics of Deep Video Inpainting

XIONG Yimao¹, DING Xiangling^1,2,3, GU Qing¹, YANG Gaobo⁴, ZHAO Xianfeng^2,5

(1.School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan 411201, China;2.State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;3.Zhengzhou Xinda Institute of Advanced Technology, Zhengzhou 450000, China;4.College of Computer and Communication, Hunan University, Changsha 410082, China;5.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100093, China)

Abstract:

Deep video inpainting is to fill missing areas or remove the specific target objects in the video by using deep learning technology. It is also exploited to synthesize tampered videos. The tampered videos are arduous to be identified with the naked eye. Especially, it is easy to cause negative public perspectives when some maliciously inpainted videos are spread on social media. At present, although it has received some attentions, its passive detection was far from enough in the depth and breadth of research for deep video inpainting. Therefore, this paper proposes a passive forensics technique based on cascaded ConvGRU and eight-direction local attention to achieve the localization of inpainted regions in deep tampered videos. The proposed method aims to localize the tampered regions in deep inpainted videos from the spatiotemporal domain. Firstly, RGB frames and error-level analysis frames, ELA, are fed into the encoder in parallel to extract more features of the inpainted area, and then multi-modal features are generated at different scales through channel feature-level fusion. Secondly, in the decoder, encoder-generated multimodal features cascaded ConvGRUs are utilized to capture the temporal continuity between video frames. Finally, in the last level RGB feature of the encoder, an eight-direction local attention module is introduced, which pays attention to the neighborhood information of pixels through eight directions and captures the anomaly between pixels in the inpainted area. In the experiment, four latest deep video inpainting methods, VI, OP, DSTT, and FGVC, were used to compare their performance with existing deep video inpainting tamper detection methods, HPF and VIDNet. The performance was superior to HPF and comparable to VIDNet was achieved when the encoder parameters were only one-fifth of VIDNet. The results show that the proposed method focuses on the correlation between pixels by generating multi-modal features and introduces an eight-direction local attention module. Concurrently, the ConvGRU takes advantage of capturing temporal anomalies, by achieving tampered positioning, and obtaining accurate localization effect.

Key words: deep video inpainting video forgery detection cascaded ConvGRU local attention module spatio-temporal prediction