基于组件分割的钓鱼URL检测方法

钟文康; 王添; 张功萱

引用本文：

钟文康,王添,张功萱.基于组件分割的钓鱼URL检测方法[J].信息安全学报,2025,10(1):130-142 [点击复制]
ZHONG Wenkang,WANG Tian,ZHANG Gongxuan.Phishing URL Detection Method Based on Component Segmentation[J].Journal of Cyber Security,2025,10(1):130-142 [点击复制]

本文已被：浏览 1291次下载 1075次	码上扫一扫！
基于组件分割的钓鱼URL检测方法
钟文康¹, 王添², 张功萱³
0 字体:加大+\|默认\|缩小-
(1.网络空间安全学院南京理工大学南京中国 210094;2.信息工程学院江苏财会职业学院连云港中国 222061;3.计算机科学与工程学院南京理工大学南京中国 210094)

摘要:

URL作为钓鱼网站最直接也是最重要的特征,利用深度学习的方法对分词后的URL字符序列进行特征提取,可以极大的提升基于URL的钓鱼网站识别的准确率。将URL按照不同组件进行分割是URL常见的分词手段,该方法能够对不同组件进行多粒度的特征判别,但是这一方法未能在钓鱼网站的URL检测中得到有效应用,尚缺乏深入的研究。此外,现有的基于深度学习的钓鱼网站URL检测方法由于实验数据以及模型训练方法上的局限性,在泛化能力和误报率方面仍存在不足,难以满足真实环境中复杂的识别需求。为解决上述问题,本文提出了一种基于组件分割的钓鱼URL检测方法:(1)该方法首先对URL的不同组件进行分割,并对各组件依次进行字符级分词、截断填充及编码,使得深度学习模型能够对不同组件采取不同层级的管理从而进行细粒度的特征判别。(2)为了避免卷积神经网络中采用的池化策略过于关注局部特征而忽视特征整体空间结构的问题,本文所提方法将对融合后的各组件特征利用胶囊网络进一步提取。(3)在模型训练方法中引入对抗训练机制,对多嵌入层进行独立对抗训练,以满足模型对各组件的差异化处理,从而进一步提升模型的泛化能力。最后,在百万级的样本数据集中,与现有的最先进的同类方法相比,所提方法在钓鱼URL的识别准确率上提升0.86%,误报率降低1.08%,F1-Score提升0.95%。

关键词: 钓鱼URL检测胶囊网络对抗训练数据处理深度学习

DOI：10.19363/J.cnki.cn10-1380/tn.2025.01.10

投稿时间：2023-02-27修订日期：2023-04-25

基金项目:本课题得到国家自然科学基金(No.62272232),江苏省自然科学基金青年基金项目(No.SBK2024041254),江苏省高等学校自然科学研究面上项目(No.24KJB520002),连云港市科技计划项目(No.JCYJ2328),江苏财会职业学院科研启动基金(No.2023GC06)资助

Phishing URL Detection Method Based on Component Segmentation

ZHONG Wenkang¹, WANG Tian², ZHANG Gongxuan³

(1.School of Cyber Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China;2.School of Information Engineering, Jiangsu College of Finance & Accounting, Lianyungang 222061, China;3.School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China)

Abstract:

As the most direct, important feature of phishing websites, feature extraction of URL character sequences after word segmentation can improve the accuracy of URL-based phishing detection using deep learning methods. Segmentation of URLs by components is a commonly used URL processing method that enables models to discriminate between the components at different granularities while this method has not been used in phishing URL detection, and the effectiveness of this processing still needs to be independently experimentally demonstrated. Due to the limitations of experimental data and model training methods, existing deep learning-based phishing URL detection methods still have shortcomings in terms of generalization ability and false alarm rate, which are challenging to meet the complex needs of real-world environments. To solve the above problems, this paper proposes a component-based segmentation method for phishing URL detection: (1) We first segment the URLs into different components, we then perform character-level word separation, truncation filling, and coding for each component so that the deep learning model can adopt different degrees of strict and fine-grained feature discrimination for different components. (2) To avoid the pooling strategy used in convolutional neural networks (CNNs), which focuses on local features and ignores the overall spatial structure of the features, the proposed method uses a capsule network (CapsNet) to extract the fused features of each component further. (3) The adversarial training mechanism is introduced in the model training method to conduct independent adversarial training for multiple embedding layers to satisfy the differentiation of the model for each component, further enhancing the generalisation capability of the model. Through extensive simulations, the result shows a 0.86% improvement in accuracy, 1.08% reduction in false alarm rate, and 0.95% improvement in F1-Score compared to existing state-of-the-art methods in a dataset of millions of samples.

Key words: phishing URL detection capsule networks adversarial training data processing deep learning