引用本文
  • 郑献春,王瑞,闫皓楠,赵兴文,李晖,李凤华.基于分布式爬虫的高性能Tor网络内容监控系统[J].信息安全学报,已采用    [点击复制]
  • ZHENG Xianchun,WANG Rui,YAN Haonan,ZHAO Xingwen,LI Hui,LI Fenghua.A High Performance Tor Web Content Monitoring System Based on Distributed Crawlers[J].Journal of Cyber Security,Accept   [点击复制]
【打印本页】 【在线阅读全文】【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 112次   下载 0  
基于分布式爬虫的高性能Tor网络内容监控系统
郑献春1, 王瑞1, 闫皓楠1, 赵兴文1, 李晖1, 李凤华2
0
(1.网络与信息安全学院 西安电子科技大学 西安 中国;2.信息安全国家重点实验室 中国科学院信息工程研究所 北京 中国)
摘要:
随着网络的发展和普及,人们对于安全性、匿名性、反审查等信息安全的需求快速增强,越来越多的人开始关注和研究Tor匿名通信网络。目前针对Tor网络内容监控的研究工作大部分存在功能少、性能弱等缺点,如缺乏为暗网设计的专用爬虫,对Tor网络连接没有进行优化,系统设计无法保障稳定快速运行,因此本文设计开发了一套综合性的Tor网络内容动态感知及情报采集系统,包含数据采集爬虫以及网页内容分类两个部分,其中爬虫部分使用了分布式架构,创新性地优化了Tor连接链路以提高爬取速度和稳定性,同时网页内容分类部分使用了自然语言处理技术,对爬取到的网页进行高效分类。然后我们将该系统放到了实际Tor网络环境中进行了测试,对系统的性能进行了评估和分析。最后通过与其他现有框架进行对比,证明本文提出的方案在暗网域名、网页、数据爬取的量级和速度性能方面均为最佳。
关键词:  洋葱路由  暗网  爬虫  自然语言处理
DOI:
投稿时间:2021-09-24修订日期:2022-01-18
基金项目:国家自然科学基金重点项目(No. 61732022),公安部技术研究计划(No. 2019JSYJA01),陕西省自然科学基金项目(No. 2019ZDLGY12-02),陕西省创新团队(No. 2018TD-007)
A High Performance Tor Web Content Monitoring System Based on Distributed Crawlers
ZHENG Xianchun1, WANG Rui1, YAN Haonan1, ZHAO Xingwen1, LI Hui1, LI Fenghua2
(1.School of Cyber Engineering,Xidian University,Xi’an;2.State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China)
Abstract:
With the development and popularity of the network, people's demand for security, anonymity, anti-censorship and other information security is rapidly increasing. More and more people are paying attention to and studying Tor anonymous communication network. At present, most of the research work on Tor network content monitoring has shortcomings such as few functions and weak performance. For example, there is still a lack of special crawlers de-signed for the dark web, the Tor network connection is not optimized, and the system design cannot guarantee sta-ble and fast operation. In this paper, we design and develop a comprehensive Tor network content dynamic aware-ness and intelligence collection system, including two parts: data collection crawler and web content classification, in which the crawler part uses distributed architecture and innovatively optimizes the Tor connection link to im-prove crawling speed and stability, while the web content classification part uses natural language processing technology to efficiently classify the crawled web pages. Then we tested the system in a real Tor network environ-ment to evaluate and analyze the performance of the system. Finally, by comparing with other existing frameworks, we found that the solution proposed in this paper is the best in terms of the magnitude and speed performance of dark web domain names, web pages, and data crawling.
Key words:  tor, dark  net, crawler, natural language processing