一种基于API调用图和外部知识的恶意代码分类方法

夏冰; 楚世豪; 董玉; 唐崇俊; 单征

引用本文：

夏冰,楚世豪,董玉,唐崇俊,单征.一种基于API调用图和外部知识的恶意代码分类方法[J].信息安全学报,已采用 [点击复制]
xiabing,chushihao,dongyu,tangchongjun,shanzheng.A Malicious Code Classification Method Based on API Call Graphs and External Knowledge[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 696次下载 0次
一种基于API调用图和外部知识的恶意代码分类方法
夏冰¹, 楚世豪¹, 董玉¹, 唐崇俊¹, 单征²
0 字体:加大+\|默认\|缩小-
(1.中原工学院;2.嵩山实验室)

摘要:

正确识别恶意代码的类别已成为数字防御机制的关键组成部分。现有基于应用程序编程接口(Application Programming Interface，API)调用的恶意代码分类方法，存在结构关系缺失、行为语义表达能力弱的问题。针对该问题，提出一种新颖的基于深度学习的恶意代码分类方法。首先基于二进制函数控制流图构造函数内部使用的API调用子图，将其作为节点嵌入至函数调用图以构造二进制文件API调用图，接着通过自动化网络爬虫技术丰富API上下文描述信息以增强API特征表示，最后借助BERT和图卷积网络(Graph Convolutional Networks, GCN)充分挖掘序列词汇语义和图节点间结构依赖语义，其融合结果为二进制文件的恶意代码行为表示。在多样化的恶意代码样本集上进行的实验结果表明，方法的各项性能指标显著优于基线模型，F1分值达到97.8%。最后的可解释性实验也证明了本方法的有效性。

关键词: 恶意代码 BERT预训练模型自然语言处理图卷积神经网络 API调用图

DOI：

投稿时间：2024-08-13修订日期：2024-11-28

基金项目:国家自然科学基金项目（面上项目，重点项目，重大项目）

A Malicious Code Classification Method Based on API Call Graphs and External Knowledge

xiabing¹, chushihao¹, dongyu¹, tangchongjun¹, shanzheng²

(1.Zhongyuan University of Technology;2.Songshan Laboratory)

Abstract:

Correctly identifying the categories of malicious code has become a crucial component of digital defense mechanisms. Existing malicious code classification methods based on Application Programming Interface (API) calls suffer from a lack of structural relationships and weak semantic expression capabilities. To address these issues, a novel deep learning-based malicious code classification method is proposed. Initially, control flow graphs of binary functions are utilized to construct subgraphs of internal API calls,which are then embedded as nodes in a function call graph to form a binary file API call graph. Subsequently, automated web crawling technologies enhance the representation of API features by enriching the contextual descriptions of APIs. Finally, by leveraging BERT and Graph Convolutional Networks (GCN), the method ef-fectively mines the semantic dependencies of sequence vocabulary and graph nodes. The integrated results provide a representation of the malicious behaviors in binary files. Experimental results on a diverse set of malicious code samples demonstrate that the method significantly outperforms baseline models, achieving an F1 score of 97.8%. The interpreta-bility experiments also confirm the effectiveness of this approach.

Key words: malware BERT pre-trained model natural language processing graph convolutional networks API call graph