SEMBeF:一种基于分片循环神经网络的敏感高效的恶意代码行为检测框架

詹静; 范雪; 刘一帆; 张茜

本文已被：浏览 6514次下载 5036次	码上扫一扫！
SEMBeF:一种基于分片循环神经网络的敏感高效的恶意代码行为检测框架
詹静,范雪,刘一帆,张茜
分享到：微信更多字体:加大+\|默认\|缩小-
(北京工业大学, 计算机学院, 北京中国 100124;可信计算北京市重点实验室, 北京中国 100124;信息安全等级保护关键技术国家工程实验室, 北京中国 100124)

摘要:

词向量和循环神经网络（Recurrent Neural Network，RNN）能够识别语义和时序信息，在自然语言识别方面中取得了巨大成功。同时，代码运行时产生的API调用序列也反映了代码的真实意图，因此我们将之应用于恶意代码识别中，期望在取得较高正确率的同时减少人工提取和分析代码特征工作。然而仍然存在三个问题：1）不少恶意代码故意通过随机混合调用敏感API和非敏感API破坏正常的上下文，对这两种API同等对待可能产生漏报；2）为尽可能全面收集代码行为，代码运行期间产生的API序列长度较长，这将导致RNN学习时间过长；3）经典RNN常用的softmax分类函数泛化能力不强，准确率有待提高。为了解决上述问题，本文提出了一种基于分片RNN（Sliced Recurrent Neural Network，SRNN）的敏感高效的恶意代码行为检测架构SEMBeF。在SEMBeF中，我们提出了一种安全敏感API权重增强的敏感词向量算法，使得代码表示结果既包含上下文信息又包含安全敏感权重信息；我们还提出了一种SGRU-SVM网络结构，通过并行计算大幅降低了因代码API调用序列过长引起的训练时间过长的问题，提高了检测正确率；最后针对样本平衡和网络模型超参数选择问题进行了优化，进一步提高了检测正确率。本文还实现了SEMBeF验证系统，实验表明，与其他基于经典词向量和RNN的深度学习方法以及常用的机器学习方法相比，SEMBeF不仅检测正确率最高，训练效率也得到了显著提升。其中，检测正确率和训练时间分别为99.40%和210分钟，与传统RNN相比，正确率提高了0.48%，训练时间下降了96.6%。

关键词: 恶意代码行为检测 API 序列敏感词向量模型分片循环神经网络(Sliced Recurrent Neural Network, SRNN)

DOI：10.19363/J.cnki.cn10-1380/tn.2019.11.06

投稿时间：2017-12-13修订日期：2018-03-07

基金项目:本论文工作得到国家重点研发计划项目（No.2016YFB0800204）；国防科研试验信息安全实验室对外开放项目（No.2016XXAQ08）；国家高技术研究发展计划（No.2015AA016002）资助。

SEMBeF: A Sensitive and Efficient Malware Behavior Detection Framework based on Sliced Recurrent Neural Network

ZHAN Jing,FAN Xue,LIU Yifan,ZHANG Qian

School of Computer, Beijing University of Technology, Beijing 100124, China;Beijing Key Laboratory of Trusted Computing, Beijing 100124, China;National Engineering Laboratory of Key Technologies of Information Security Grade Protection, Beijing 100124, China

Abstract:

With word vector space model, Recurrent Neural Network (RNN) can identify semantic and temporal information, and has achieved great success in natural language recognition. Similarly, the sequence of API calls generated by the code runtime also reflects the real intention of the code. Therefore, we apply it to malicious code detection, expecting to achieve high accuracy while reducing the manual work of extraction and analysis of code features. However, there are still three problems:1) many malicious codes intentionally destroy the normal context by randomly mixing sensitive APIs and non-sensitive APIs; 2) in order to collect code behavior as comprehensively as possible, the length of API sequence generated while code is running could be very long, which will lead to the long learning time; 3) softmax classification is commonly used with classical RNN, and there's still space for accuracy improvement. To solve the above problems, a Sensitive and Efficient Malware Behavior detection Framework (SEMBeF) based on Sliced Recurrent Neural Network (SRNN) is proposed in this paper. In SEMBeF, we propose a sensitive word vector space algorithm to enhance the weights of security-sensitive API, which makes the results of code representation contain both context information and security-sensitive weight information. We also propose a SGRU-SVM network structure, which greatly reduces the problem of long training time caused by long API sequence of code and improves the detection accuracy. Finally, SGRU-SVM optimization is proposed to solve the problem of sample balance and hyper-parameter selection, which further improves the detection accuracy. This paper also implements the SEMBeF PoC (Proof of Cocept) system. Experiments show that compared with other deep learning methods based on classical word vector space model, machine learning methods and other common deep neural network models, SEMBeF system not only has the highest detection accuracy, but also improves the training efficiency significantly. The detection accuracy and training time of SEMBeF are 99.40% and 210 minutes, respectively. Compared with traditional GRU model, the accuracy increased by 0.48%, and the training time is decreased by 96.6%.

Key words: malware behavior detection API sequence sensitive word vector space model sliced recurrent neural network (SRNN)