基于分布式爬虫的高性能Tor网络内容监控系统

郑献春; 王瑞; 闫皓楠; 赵兴文; 李晖; 李凤华

本文已被：浏览 4463次下载 4265次	码上扫一扫！
基于分布式爬虫的高性能Tor网络内容监控系统
郑献春,王瑞,闫皓楠,赵兴文,李晖,李凤华
分享到：微信更多字体:加大+\|默认\|缩小-
(网络与信息安全学院西安电子科技大学西安中国 710126;网络与信息安全学院西安电子科技大学西安中国 710126;综合业务网理论及关键技术国家重点实验室西安电子科技大学西安中国 710071;信息安全国家重点实验室中国科学院信息工程研究所北京中国 100093;网络空间安全学院中国科学院大学北京中国 100049)

摘要:

随着网络的发展和普及, 人们对于安全性、匿名性、反审查等信息安全的需求快速增强, 越来越多的人开始关注和研究Tor 匿名通信网络。目前针对 Tor 网络内容监控的研究工作大部分存在功能少、性能弱等劣势, 如缺乏为暗网设计的专用爬虫, 网络连接速度较慢, 本文设计开发了一套综合性的 Tor 网络内容动态感知及情报采集系统, 包含数据采集爬虫以及网页内容分类两个部分。其中爬虫部分使用了分布式架构, 包括了任务管理模块、爬虫调度模块、网页下载模块、页面解析模块、数据存储模块, 同时创新性地优化了 Tor 连接链路以提高爬取速度和稳定性; 网页内容分类部分使用了自然语言处理技术, 建立训练模型并对抓取到的信息进行精准高效分类, 解决分类的准确度和复杂性问题, 最后根据结果分析暗网的内容结构和敏感信息。我们也相应地为保障系统运行设计了容错模块和预警模块, 从而对系统各个组件的当前状态进行实时监控, 并将系统的状态数据进行整合、收集和展示。最后我们将该系统放到了实际 Tor 网络环境中进行了测试, 从系统网页爬取效果、内容分类效果及系统性能等三方面进行了评估和分析, 并与国内外 7 中现有的框架的功能进行了对比, 证明本文提出的方案在暗网域名、网页、数据爬取的量级和速度性能方面均为最佳。

关键词: 洋葱路由暗网爬虫自然语言处理

DOI：10.19363/J.cnki.cn10-1380/tn.2023.01.11

投稿时间：2021-09-24修订日期：2022-01-18

基金项目:本课题得到国家自然科学基金重点项目(No. 61732022), 公安部技术研究计划(No. 2019JSYJA01), 陕西省自然科学基金项目(No.2019ZDLGY12-02), 陕西省创新团队(No. 2018TD-007)的资助。

A High Performance Tor Web Content Monitoring System Based on Distributed Crawlers

ZHENG Xianchun,WANG Rui,YAN Haonan,ZHAO Xingwen,LI Hui,LI Fenghua

School of Cyber Engineering, Xidian University, Xi'an 710126, China;School of Cyber Engineering, Xidian University, Xi'an 710126, China;State Key Laboratory of Integrated Services Networks, Xidian University, Xi'an 710071, China;State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China

Abstract:

With the development and popularization of the network, people's demands for information security such as security, anonymity, and anti-censorship are rapidly increasing, and more and more people begin to pay attention to and study the Tor anonymous communication network. At present, most of the research work on Tor network content monitoring has disadvantages such as few functions and weak performance. For example, there is a lack of a dedicated crawler designed for the dark web, and the network connection speed is slow. This paper designs and develops a comprehensive set of Tor network content dynamic perceptions, and intelligence collection system, including data collection crawler and web content classification. The crawler part uses a distributed architecture, including task management module, crawler scheduling module, web page download module, page parsing module, data storage module, and innovatively optimizes the Tor connection link to improve the crawling speed and stability. The web content classification part uses natural language processing technology to establish a training model and classify the captured information accurately and efficiently, to solve the problem of classification accuracy and complexity, and finally analyze the content structure and sensitive information of the dark web according to the results. We also designed a fault-tolerant module and an early warning module to ensure the operation of the system, to monitor the current status of each component of the system in real-time, and integrate, collect and display the status data of the system. Finally, we put the system into the actual Tor network environment for testing, and evaluated and analyzed from three aspects of system web page crawling effect, content classification effect, and system performance, and compared with the functions of existing frameworks at home and abroad. A comparison is made, and it is proved that the scheme proposed in this paper is the best in terms of the magnitude and speed performance of dark web domain names, web pages, and data crawling.

Key words: tor dark net crawler natural language processing