面向恶意PDF文档分类的对抗样本生成方法研究

刘超; 娄尘哲; 喻民; 姜建国; 黄伟庆

引用本文：

刘超,娄尘哲,喻民,姜建国,黄伟庆.面向恶意PDF文档分类的对抗样本生成方法研究[J].信息安全学报,2023,8(5):14-26 [点击复制]
LIU Chao,LOU Chenzhe,YU Min.Research on Adversarial Example Generation Method for Malicious PDF Document Classification[J].Journal of Cyber Security,2023,8(5):14-26 [点击复制]

本文已被：浏览 9751次下载 6822次	码上扫一扫！
面向恶意PDF文档分类的对抗样本生成方法研究
刘超¹, 娄尘哲^1,2, 喻民^1,2, 姜建国¹, 黄伟庆¹
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所北京中国 100093;2.中国科学院大学网络空间安全学院北京中国 100093)

摘要:

通过恶意文档来传播恶意软件在现代互联网中是非常普遍的,这也是众多机构面临的最高风险之一。PDF文档是全世界应用最广泛的文档类型,因此由其引发的攻击数不胜数。使用机器学习方法对恶意文档进行检测是流行且有效的途径,在面对攻击者精心设计的样本时,机器学习分类器的鲁棒性有可能暴露一定的问题。在计算机视觉领域中,对抗性学习已经在许多场景下被证明是一种有效的提升分类器鲁棒性的方法。对于恶意文档检测而言,我们仍然缺少一种用于针对各种攻击场景生成对抗样本的综合性方法。在本文中,我们介绍了PDF文件格式的基础知识,以及有效的恶意PDF文档检测器和对抗样本生成技术。我们提出了一种恶意文档检测领域的对抗性学习模型来生成对抗样本,并使用生成的对抗样本研究了多检测器假设场景的检测效果(及逃避有效性)。该模型的关键操作为关联特征提取和特征修改,其中关联特征提取用于找到不同特征空间之间的关联,特征修改用于维持样本的稳定性。最后攻击算法利用基于动量迭代梯度的思想来提高生成对抗样本的成功率和效率。我们结合一些具有信服力的数据集,严格设置了实验环境和指标,之后进行了对抗样本攻击和鲁棒性提升测试。实验结果证明,该模型可以保持较高的对抗样本生成率和攻击成功率。此外,该模型可以应用于其他恶意软件检测器,并有助于检测器鲁棒性的优化。

关键词: 恶意 PDF 文档|对抗样本|文档分类|样本生成|鲁棒性

DOI：10.19363/J.cnki.cn10-1380/tn.2023.09.02

投稿时间：2020-03-22修订日期：2020-04-11

基金项目:本课题得到中国科学院青年创新促进会(No. 2021155)资助。

Research on Adversarial Example Generation Method for Malicious PDF Document Classification

LIU Chao¹, LOU Chenzhe^1,2, YU Min¹

(1.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100093, China)

Abstract:

Spreading of malware via malicious documents is very common in the modern Internet and is one of the highest risks faced by many organizations. PDF documents are the most widely used document type worldwide, and as a result, there are countless attacks caused by them. The use of machine learning methods for malicious document detection is a popular and effective approach, but the robustness of machine learning classifiers has the potential to expose certain problems in the face of well-designed samples from attackers. In the field of computer vision, adversarial learning has proven to be an effective method for improving the robustness of classifiers in many scenarios. For malicious document detection, we still lack a comprehensive approach for generating adversarial examples for various attack scenarios. In this paper, we introduce the basics of PDF file formats, as well as effective malicious PDF document detectors and adversarial sample generation techniques. We propose a model to generate adversarial examples for adversarial learning in the area of malicious documents detection, and use the generated adversarial examples study the detection effectiveness (and evasion effectiveness) for hypothetical scenarios with multiple detectors. The key operations of the model are association feature extraction and feature modification, where association feature extraction is used to find the associations between different feature spaces and feature modification is used to maintain the stability of the examples. The final attack algorithm leverages the idea of momentum-based iterative gradient to boost the success rate and efficiency of generating adversarial examples. We combined some convincing datasets and rigorously set up the experimental environment and metrics, followed by tests against example attacks and robustness enhancement. Experimental results confirmed that the proposed model can maintain a high level of generation rate and success rate. Moreover, this model can be applied to other malware detectors and contribute to robust optimization.

Key words: malicious PDF document|adversarial example|document classification|example generation|robustness