基于机器学习的僵尸网络DGA域名检测系统设计与实现

于光喜; 张棪; 崔华俊; 杨兴华; 李杨; 刘畅

本文已被：浏览 9665次下载 11319次	码上扫一扫！
基于机器学习的僵尸网络DGA域名检测系统设计与实现
于光喜,张棪,崔华俊,杨兴华,李杨,刘畅
分享到：微信更多字体:加大+\|默认\|缩小-
(中国科学院信息工程研究所, 北京中国 100093;中国科学院大学网络空间安全学院, 北京中国 100049)

摘要:

僵尸网络广泛采用域名生成算法（Domain Generation Algorithm，DGA）生成大量的随机域名来躲避检测。针对僵尸网络DGA域名问题，本文设计实现了一种DGA域名检测系统。首先使用基于随机森林算法的轻量级分类分析检测模块，通过分析域名字符特征区分正常域名与疑似恶意域名，满足现网实际应用中快速检测的要求；然后使用基于X-means算法的聚类分析检测模块，在分类分析检测的基础上，根据DGA域名的字符相似性和查询行为相似性，通过聚类和集合分析方法对疑似恶意域名进一步检测，降低系统误检率。通过部署基于Spark的检测系统对某运营商现网真实DNS日志数据进行连续20天的处理和分析，检测系统平均每天挖掘出约250万DGA域名，经过正则匹配分析，其中约55%属于5类已知的DGA；在前两个实验日，共发现13，000个已知DGA域名分属于3个DGA类别。实验结果表明检测系统可有效检测出多种DGA域名，此外，检测系统也可满足现网实际应用中快速检测的要求。

关键词: 域名生成算法机器学习字符分析访问行为分析分布式处理

DOI：10.19363/J.cnki.cn10-1380/tn.2020.05.04

投稿时间：2018-07-19修订日期：2018-12-19

基金项目:本课题得到中国科学院信息工程研究所创新科研项目（No.J810091105）和引进优秀青年人才项目（No.Y6Z0011105）资助。

Design and Implementation of A DGA Domain Name Detection System by Machine Learning

YU Guangxi,ZHANG Yan,CUI Huajun,YANG Xinghua,LI Yang,LIU Chang

Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China

Abstract:

To avoid detection, botnets usually use domain generation algorithms (DGAs) to generate a large number of random domain names. In this paper, we designed and implemented a DGA domain names detection system. By using the features of domain name character, we first designed a classification module, which is a random forest-based and a lightweight detection module, aiming to distinguish suspicious domain names from normal ones and meet demand of fast detection in real network. Then based on the results of classification, we designed an X-means clustering module, which uses a clustering and set analysis detection method to analyze features of query behaviors and domain name characters, aiming to further analyze suspicious domain names and reduce the false positive rate. This system was implemented by the Spark. By processing and analyzing the real ISP network DNS log datasets over 20 days, this system detected about 2.5 million DGA domain names on average every day. After matching regex expressions, we found that about 55% of them belonging to 5 known DGA families were matched. And more than 13,000 regex matched domain names belonging to 3 DGA families hit the known DGA domain names in first two experimental days. Overall, experiment results show that this system can detect multiple DGA domain names effectively. In addition, this system can also meet the demand of fast detection in real network.

Key words: domain generation algorithm machine learning character analysis querying behavior analysis distributed processing