基于聚类过采样和自动编码器的网络入侵检测方法

蹇诗婕; 刘岳; 姜波; 卢志刚; 刘玉岭; 刘宝旭

本文已被：浏览 2142次下载 2180次	码上扫一扫！
基于聚类过采样和自动编码器的网络入侵检测方法
蹇诗婕,刘岳,姜波,卢志刚,刘玉岭,刘宝旭
分享到：微信更多字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所, 北京中国 100093;中国科学院大学网络空间安全学院, 北京中国 100049)

摘要:

近年来, 随着互联网技术的不断发展, 入侵检测在维护网络空间安全方面发挥着越来越重要的作用。但是, 由于网络入侵行为的数据稀疏性, 已有的检测方法对于海量流量数据的检测效果较差, 模型准确率、F-measure等指标数值较低, 并且高维数据处理的成本过高。为了解决这些问题, 本文提出了一种基于稀疏异常样本数据场景下的新型深度神经网络入侵检测方法, 该方法能够有效地识别不平衡数据集中的异常行为。本文首先使用k均值综合少数过采样方法来处理不平衡的流量数据, 解决网络流量数据类别分布不平衡问题, 平衡网络流量数据分布。再采用自动编码器来处理海量高维数据并训练检测模型, 来提升海量高维流量中异常行为的检测精度, 并在两个真实典型的入侵检测数据集上进行了大量的实验。实验结果表明, 本文所提出的方法在两个真实典型数据集上的检测准确率分别为99.06％和99.16%, F-measure分别为99.15%和98.22%。相比于常用的欠采样和过采样方法, k均值综合少数过采样技术能够有效地解决网络流量数据类别分布不平衡的问题, 提升模型对低频攻击行为的检测效果。同时, 与已有的网络入侵检测方法相比, 本文所提出的方法在准确率、F-measure和检测性能上均有明显提升, 证明了本文所提出的方法对于海量网络流量数据的检测具有较高的检测精度和良好的应用前景。

关键词: 入侵检测海量流量数据类别不平衡自动编码器 k均值综合少数过采样技术

DOI：10.19363/J.cnki.cn10-1380/tn.2023.11.10

投稿时间：2020-04-14修订日期：2020-06-08

基金项目:本论文得到国家重点研发计划 (No. 2019QY1303, No. 2019QY1302, No. 2021YFC3300401)、中国科学院战略性先导 C 类 (No.XDC02040100)、中国科学院青年创新促进会(No. 2021156)的资助。

Network Intrusion Detection Using Cluster Oversampling and Auto-Encoder

JIAN Shijie,LIU Yue,JIANG Bo,LU Zhigang,LIU Yuling,LIU Baoxu

Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China

Abstract:

With the continuous development of Internet technology, intrusion detection is becoming more and more important to safeguard the security of cyberspace in these years. However, existing detection methods work poorly on massive traffic data due to the data sparsity of the network intrusion behaviors. The accuracy rate, F-measure and other indicators are relatively low. In addition, the cost of high-dimensional data processing is too high. To address these issues, we propose a novel deep neural network intrusion detection method based on sparse abnormal sample data scenarios, which is called K-means Sparse Anomaly Intrusion Detection System (KSAIDS). It can be used to effectively identify the abnormal behaviors in imbalanced datasets. In particular, we first use k-means Synthetic Minority Over-sampling Technique method to deal with the imbalanced traffic data, which can effectively solve the problem of unbalanced distribution of network traffic data categories and balance the distribution of network traffic data. The proposed model then employs Auto-Encoder to process the massive high-dimensional data and train detection model so as to improve the detection accuracy of abnormal behaviors in massive high-dimensional traffic. And extensive experiments are carried out on two real-world typical intrusion detection datasets. Experimental analysis results demonstrate that the detection accuracy of the proposed method on two real-world typical datasets is 99.06% and 99.16%, and the F-measure is 99.15% and 98.22%, respectively. Compared with the commonly used under-sampling and over-sampling methods, the k-means Synthetic Minority Over-sampling Technique method can effectively solve the problem of unbalanced distribution of network traffic data categories and improve the model's detection effect on low-frequency attack behavior. At the same time, compared with the state-of-the-art models of intrusion detection, the detection accuracy rate, F-Measure and detection performance of the KSAIDS method are significantly improved, which proves that the KSAIDS method has high detection accuracy and great application prospects for the detection of large-scale network traffic data.

Key words: intrusion detection massive traffic data class imbalanced auto-encoder k-means synthetic minority over-sampling technique