基于关键序列特征的Tor暗网隐藏服务访问行为识别

王学宾; 李泽禹; 王美琪; 黄文涛; 时金桥; 谭庆丰; 刘杰; 方滨兴

引用本文：

王学宾,李泽禹,王美琪,黄文涛,时金桥,谭庆丰,刘杰,方滨兴.基于关键序列特征的Tor暗网隐藏服务访问行为识别[J].信息安全学报,已采用 [点击复制]
WANG XUEBIN,Li Zeyu,Wang Meiqi,Huang Wentao,Shi Jinqiao,Tan Qingfeng,Liu Jie,Fang Binxing.Hidden Service Access Activity Recognition Based on Key Sequences Characteristics[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 5338次下载 656次
基于关键序列特征的Tor暗网隐藏服务访问行为识别
王学宾¹, 李泽禹², 王美琪¹, 黄文涛², 时金桥², 谭庆丰³, 刘杰¹, 方滨兴⁴
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所;2.北京邮电大学网络空间安全学院;3.广州大学网络空间先进技术研究院;4.电子科技大学广东电子信息工程研究院)

摘要:

Tor暗网隐藏服务访问行为识别是溯源Tor暗网用户的一种有效途径，现有的识别算法以用户访问过程的流量报文序列作为输入提取特征构建分类模型，取得了较好的识别效果，但在待检测网络流量规模大、在线识别时效性要求高的实际场景中应用时，现有方法需要依赖的流量报文序列较长，无法实现早期识别，维持检测识别所需的内存计算资源消耗较大。针对此问题，本文在细化分析Tor网络协议的链路语义及报文分布的基础上，提出了一种基于关键序列特征的Tor暗网隐藏服务访问行为识别方法，仅利用位于访问行为早期阶段的具有重要语义区分性的特定区间流量报文序列作为输入提取特征构建分类模型，即可有效识别访问行为。相较于已有方法，本方法需要依赖的流量报文序列较短，可有效提升识别时效性、降低硬件资源代价。为验证本文方法的有效性，本文基于多种实际访问场景捕获流量构建实验数据集，并精细标注了Tor协议级别语义的链路报文序列区间。实验结果表明，在Tor网络直连和混淆2大类共计6种接入场景中，对于区分暗网隐藏服务访问行为和其他行为贡献最大的网络报文和本文提取的关键报文序列具有高重合度，验证了本文提取的关键报文序列是具有重要语义区分性的。利用此关键报文序列提取特征构建的分类模型，相比已有工作在识别精确率及F1值上均可提升2%-3%，且识别时效性提升27%-51%，识别模型的输入特征序列长度降低78%-95%。

关键词: 匿名网络隐藏服务加密流量分类深度学习流量混淆

DOI：10.19363/J.cnki.cn10-1380/tn.2024.04.01

投稿时间：2021-12-31修订日期：2022-03-10

基金项目:

Hidden Service Access Activity Recognition Based on Key Sequences Characteristics

WANG XUEBIN¹, Li Zeyu², Wang Meiqi¹, Huang Wentao², Shi Jinqiao², Tan Qingfeng³, Liu Jie¹, Fang Binxing⁴

(1.Institute of Information Engineering, Chinese Academy of Sciences;2.School of Cyber Security, Beijing University of Posts and Telecommunications, Beijing 100876, China;3.Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China;4.Institute of Electronic and Information Engineering of UESTC in Guangdong, Dongguan Guangdong 523808, China)

Abstract:

The identification of hidden service access behavior in tor dark network is an effective way to deanonymize tor darknet users. The existing identification algorithms take the TCP packets sequence in the user access process as the input to extract the characteristics and construct the classification model, which has achieved good recognition effect. However, when applied in the actual scene with large-scale network traffic to be detected and high timeliness of online identification, the existing methods need to rely on a long sequence of traffic packets, which can not realize early identification, and the memory and computing resources required to maintain detection and identification consume a lot. To solve this problem, based on the detailed analysis of the circuit semantics and message distribu-tion of tor network protocol, this paper proposes an access behavior recognition method of tor hidden service based on key sequence characteristics, which only uses the specific interval TCP packet sequence with important semantic distinction in the early stage of access behavior as the input to extract the characteristics to construct the classifica-tion model. Compared with the existing methods, this method needs to rely on a shorter TCP packet sequence, which can effectively improve the identification timeliness and reduce the cost of hardware resources. In order to verify the effectiveness of this method, this paper constructs the experimental data set based on a variety of actual access scenarios, and finely labels the link message sequence interval of tor protocol level semantics. The experi-mental results show that among the six access scenarios in the two categories of tor network direct connection and confusion, the network message with the greatest contribution to distinguishing the hidden service access behavior and other behaviors in the dark network has a high degree of coincidence with the key TCP sequence extracted in this paper. It is verified that the key TCP sequence extracted in this paper has important semantic discrimination. Compared with the existing work, the classification model constructed by extracting features from this key TCP se-quence can improve the recognition accuracy and F1 value by 2% - 3%, improve the recognition timeliness by 27% - 51%, and reduce the length of input feature sequence of the recognition model by 78% - 95%.

Key words: anonymity network hidden service encrypted traffic classification deep learning traffic obfuscation