引用本文
  • 夏冰,楚世豪,董玉,唐崇俊,单征.一种基于API调用图和外部知识的恶意代码分类方法[J].信息安全学报,已采用    [点击复制]
  • xiabing,chushihao,dongyu,tangchongjun,shanzheng.A Malicious Code Classification Method Based on API Call Graphs and External Knowledge[J].Journal of Cyber Security,Accept   [点击复制]
【打印本页】 【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 386次   下载 0  
一种基于API调用图和外部知识的恶意代码分类方法
夏冰1, 楚世豪1, 董玉1, 唐崇俊1, 单征2
0
(1.中原工学院;2.嵩山实验室)
摘要:
正确识别恶意代码的类别已成为数字防御机制的关键组成部分。现有基于应用程序编程接口(Application Programming Interface,API)调用的恶意代码分类方法,存在结构关系缺失、行为语义表达能力弱的问题。针对该问题,提出一种新颖的基于深度学习的恶意代码分类方法。首先基于二进制函数控制流图构造函数内部使用的API调用子图,将其作为节点嵌入至函数调用图以构造二进制文件API调用图,接着通过自动化网络爬虫技术丰富API上下文描述信息以增强API特征表示,最后借助BERT和图卷积网络(Graph Convolutional Networks, GCN)充分挖掘序列词汇语义和图节点间结构依赖语义,其融合结果为二进制文件的恶意代码行为表示。在多样化的恶意代码样本集上进行的实验结果表明,方法的各项性能指标显著优于基线模型,F1分值达到97.8%。最后的可解释性实验也证明了本方法的有效性。
关键词:  恶意代码  BERT预训练模型  自然语言处理  图卷积神经网络  API调用图
DOI:
投稿时间:2024-08-13修订日期:2024-11-28
基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目)
A Malicious Code Classification Method Based on API Call Graphs and External Knowledge
xiabing1, chushihao1, dongyu1, tangchongjun1, shanzheng2
(1.Zhongyuan University of Technology;2.Songshan Laboratory)
Abstract:
Correctly identifying the categories of malicious code has become a crucial component of digital defense mechanisms. Existing malicious code classification methods based on Application Programming Interface (API) calls suffer from a lack of structural relationships and weak semantic expression capabilities. To address these issues, a novel deep learning-based malicious code classification method is proposed. Initially, control flow graphs of binary functions are utilized to construct subgraphs of internal API calls,which are then embedded as nodes in a function call graph to form a binary file API call graph. Subsequently, automated web crawling technologies enhance the representation of API features by enriching the contextual descriptions of APIs. Finally, by leveraging BERT and Graph Convolutional Networks (GCN), the method ef-fectively mines the semantic dependencies of sequence vocabulary and graph nodes. The integrated results provide a representation of the malicious behaviors in binary files. Experimental results on a diverse set of malicious code samples demonstrate that the method significantly outperforms baseline models, achieving an F1 score of 97.8%. The interpreta-bility experiments also confirm the effectiveness of this approach.
Key words:  malware  BERT pre-trained model  natural language processing  graph convolutional networks  API call graph