基于代价敏感学习的恶意URL检测研究

蔡勍萌; 王健; 李鹏博

引用本文：

蔡勍萌,王健,李鹏博.基于代价敏感学习的恶意URL检测研究[J].信息安全学报,2023,8(2):54-65 [点击复制]
CAI Qingmeng,WANG Jian,LI Pengbo.Research on Malicious URL Detection Based on Cost-sensitive Learning[J].Journal of Cyber Security,2023,8(2):54-65 [点击复制]

本文已被：浏览 13495次下载 9011次	码上扫一扫！
基于代价敏感学习的恶意URL检测研究
蔡勍萌¹, 王健¹, 李鹏博²
0 字体:加大+\|默认\|缩小-
(1.北京交通大学计算机与信息技术学院北京中国 100044;2.中电科网络空间安全研究院北京中国 100085)

摘要:

随着大数据时代的到来,恶意URL作为Web攻击的媒介渐渐威胁着用户的信息安全。传统的恶意URL检测手段如黑名单检测、签名匹配方法正逐步暴露缺陷,为此本文提出一种基于代价敏感学习策略的恶意URL检测模型。为提高卷积神经网络在恶意网页检测领域的性能,本文提出将URL数据结合HTTP请求信息作为原始数据样本进行特征提取,解决了单纯URL数据过于简单而造成特征提取困难的问题,通过实验对比了三种编码处理方式,根据实验结果选取了最佳字符编码的处理方式,保证了后续检测模型的效果。同时本文针对URL字符输入的特点,设计了适合URL检测的卷积神经网络模型,为了提取数据深层特征,使用了两层卷积层进行特征提取,其次本文在池化层选择使用BiLSTM算法提取数据的时序特征,同时将该网络的最后一个单元输出达到池化效果,避免了大量的模型计算,保证了模型的检测效率。同时为解决数据样本不均衡问题,在迭代过程中为其分配不同惩罚因子,改进了数据样本初始化权重的分配规则并进行了归一化处理,增加恶意样本在整体误差函数中的比重。实验结果表明本文模型在准确率、召回率以及检测效率上较优于其他主流检测模型,并对于不均衡数据集具有较好的抵抗能力。

关键词: 深度学习恶意网页 URL检测代价敏感学习神经网络

DOI：10.19363/J.cnki.cn10-1380/tn.2023.03.05

投稿时间：2021-07-21修订日期：2021-11-15

基金项目:本课题得到中国国家铁路集团有限公司科技研究开发计划重点课题(No.N2020W005),以及国家保密技术测评中心项目(No.K20GY500010)的支持资助。

Research on Malicious URL Detection Based on Cost-sensitive Learning

CAI Qingmeng¹, WANG Jian¹, LI Pengbo²

(1.School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China;2.CETC Cyberspace Security Research Institute, Beijing, 100085, China)

Abstract:

In the wake of the advent of big data era, whereas the malicious URL, as the medium for Web attacking, progressively threatens the security of users’ information. Traditional detection methods in terms of malicious URL, such as blacklist detection and signature matching, are exposing their intrinsic defects, to this end, this paper proposes a malicious URL detection model based on a cost-sensitive learning strategy. In this thesis, HTTP request parameters together with URL information are employed as the original data samples to extract features; and the corresponding data processing is carried out to resolve the problem of difficult feature extraction incurred by simple URL data. In addition, by comparing three encoding processing methods through tests, this research has chosen the best processing approach in term of character encoding. By doing so, it has ensured the effectiveness of the subsequent detection model. Regarding the model of neural network, the Convolutional Neural Network model suitable for URL detection is specialized designed for the characteristics of URL character input. In this model, in order to extract the deep features of the data, two convolutional layers are broadly used. Secondly, this research utilizes a Bidirectional Long Short-Time Memory to extract the temporal features of the data from the pooling layer, while in the last unit of this network outputs the temporal features to achieve the pooling effect, this research method not only effectively extracts the contextual information regarding the data, also avoids an abundant model calculations and thus, ensures the efficiency of model detection. At the same time, in order to solve the problem of unbalanced data samples, it assigns different penalty factors to data samples during the iterative process, improves the rules for assigning initialization weights to data samples and normalizes them, increases the weight of malicious samples in the overall error function. Experimental results show that this model is better than other mainstream detection models in accuracy, recall and detection efficiency, and has better resistance to imbalanced data sets.

Key words: deep Learning malicious web page URL detection cost-sensitive learning neural networks