基于集成学习技术的恶意软件检测方法

李芳; 朱子元; 闫超; 孟丹

本文已被：浏览 3075次下载 2012次	码上扫一扫！
基于集成学习技术的恶意软件检测方法
李芳,朱子元,闫超,孟丹
分享到：微信更多字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所北京中国 100093;中国科学院大学网络空间安全学院北京中国 100049)

摘要:

近年来,低级别微结构特征已被广泛应用于恶意软件检测。但是,微结构特征数据通常包含大量的冗余信息,且目前的检测方法并没有对输入微结构数据进行有效地预处理,这就造成恶意软件检测需要依赖于复杂的深度学习模型才能获得较高的检测性能。然而,深度学习检测模型参数量较大,难以在计算机底层得到实际应用。为了解决上述问题,本文提出了一种新颖的动态分析方法来检测恶意软件。首先,该方法创建了一个自动微结构特征收集系统,并从收集的通用寄存器(General-Purpose Registers,GPRs)数据中随机抽取子样本作为分类特征矩阵。相比于其他微结构特征,GPRs特征具有更丰富的行为特征信息,但也包含更多的噪声信息。因此,需要对GPRs数据进行特征区间分割,以降低数据复杂度并抑制噪声。本文随后采用词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)技术从抽取的特征矩阵中选择最具区分性的信息来进行恶意软件检测。TF-IDF技术可以有效降低特征矩阵的维度,从而提高检测效率。为了降低模型复杂度,并保证检测方法的性能,本文利用集成学习模型来识别恶意软件。实验表明,该集成学习模型具有99.3%的检测准确率,3.7%的误报率,优于其他现有方法且模型复杂度低。此外,该方法还可以用于检测真实数据中的恶意行为。

关键词: 恶意软件检测通用寄存器集成学习词频-逆文档频率

DOI：10.19363/J.cnki.cn10-1380/tn.2024.01.10

投稿时间：2020-04-10修订日期：2020-08-11

基金项目:本课题得到中国科学院战略先导专项(No. XDC02010400), 国家重点研发计划(No. 2018YFB2202100)的资助。

Malware Detection Method Based on Ensemble Learning Technology

LI Fang,ZHU Ziyuan,YAN Chao,MENG Dan

Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China

Abstract:

In recent years, low-level hardware microarchitecture features have been widely used for malware detection. However, most of the microarchitecture features contain a large amount of redundant information, and current detection methods do not effectively preprocess the input micro-architecture data, which results in complex deep learning model to obtain high malware detection performance. However, the deep learning detection model has a large number of parameters, thus it is difficult to be practically applied at low-level hardware of computers. To solve these problems, we propose a novel dynamic analysis method to detect malware. First of all, this method creates an automatic microarchitecture feature collection system, and randomly extracts the sub-samples from the collected General-Purpose Registers (GPRs) data as the classification feature matrix. Compared with other microarchitecture features, GPRs features have much richer behavioral characteristics, but also contain noise information. Therefore, we divide the GPRs data into different intervals to reduce data complexity and inhibit noise. Then, we adopt Term Frequency-Inverse Document Frequency (TF-IDF) technique to select the most discriminative information from these matrices, for malware detection. TF-IDF technology can effectively reduce the dimension of the characteristic matrix and improve detection efficiency. In order to reduce the complexity and ensure the performance of the detection method, this paper uses ensemble learning model to identify malware. Experimental results indicate that the detection accuracy of the ensemble learning model is 99.3% with 3.7% false positive rate, which is better compared with other existing methods, and the complexity of our proposed model is lower. Besides, our method can also achieve higher detection rate in real data.

Key words: malware detection general-purpose registers ensemble learning term frequency-inverse document frequency