基于联邦学习的第三方库流量识别

崔华俊; 孟国柱; 李玥琦; 张棪; 代玥玥; 杨慧然; 朱大立; 王伟平

引用本文：

崔华俊,孟国柱,李玥琦,张棪,代玥玥,杨慧然,朱大立,王伟平.基于联邦学习的第三方库流量识别[J].信息安全学报,2023,8(3):128-145 [点击复制]
Cui Huajun,Meng Guozhu,Li Yueqi,Zhang yan,Dai Yueyue,Yang Huiran,Zhu Dali,Wang Weiping.A Third-party Library Traffic Identification Framework Using Federated Learning[J].Journal of Cyber Security,2023,8(3):128-145 [点击复制]

本文已被：浏览 6201次下载 3296次	码上扫一扫！
基于联邦学习的第三方库流量识别
崔华俊^1,2, 孟国柱^1,2, 李玥琦^1,2, 张棪^1,2, 代玥玥³, 杨慧然^1,2, 朱大立^1,2, 王伟平^1,2
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所, 北京中国 100093;2.中国科学院大学网络空间安全学院, 北京中国 100049;3.华中科技大学网络空间安全学院, 湖北中国 430074)

摘要:

第三方库(Third-party Library,TPL)已经成为移动应用开发的重要组成部分,开发者通常在应用中集成TPL以实现诸如广告、消息推送、移动支付等特定功能,从而提高开发效率并降低研发成本。然而,由于TPL与其所在的移动应用(宿主应用)共享相同的系统权限,且开发者对TPL自身的安全隐患缺乏了解,导致近年来由TPL引起的安全问题频发,给公众造成了严重的信息与隐私安全困扰。TPL的流量识别对于精细化流量管理与安全威胁检测具有重要意义,是支撑对宿主应用与TPL之间进行安全责任判定的重要能力,同时也是促进TPL安全合规发展的重要检测方法。然而目前关于TPL的研究主要集中于TPL检测、TPL引起的隐私泄漏问题等,关于TPL流量识别的研究十分少见。为此,本文提出并实现了一种用于TPL流量识别的框架——LibCapture,该框架首先基于动态插桩技术与TPL检测技术设计了自动生成TPL加密流量数据集的方法。其次,针对隐私保护以及数据共享的问题,构建了基于卷积神经网络的联邦学习模型,用于识别TPL流量。最后,通过对2327个真实应用的流量测试证明了本文所提框架具有较高的流量识别准确率。此外,本文分析了联邦学习参与方本地样本数据差异性给全局模型聚合带来的具体影响,指出了不同场景下的进一步研究方向。

关键词: 加密流量识别|第三方库|联邦学习|动态插桩

DOI：10.19363/J.cnki.cn10-1380/tn.2023.05.09

投稿时间：2022-09-06修订日期：2022-11-18

基金项目:本课题得到国家重点研发计划(No. 2019YFB1005205)资助。

A Third-party Library Traffic Identification Framework Using Federated Learning

Cui Huajun^1,2, Meng Guozhu^1,2, Li Yueqi^1,2, Zhang yan^1,2, Dai Yueyue³, Yang Huiran^1,2, Zhu Dali^1,2, Wang Weiping^1,2

(1.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China;2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China;3.School of Cyber Science and Engineering, Huazhong University of Science and Technology, Hubei 430074, China)

Abstract:

Third-party Library (TPL) has become a vital component in mobile app development. Developers usually integrate TPLs into apps to realize specific functions such as advertising, message pushing, and mobile payment, thus improving development efficiency and reducing research and development costs. However, since TPLs share the same system permissions with its mobile application (host application), and developers always lack understanding of TPL’s security risks, security and privacy leakage threats caused by TPLs have occurred frequently in recent years, causing serious information security and privacy leakage problems to the public. TPL traffic identification is of great significance for fine-grained traffic management and security threat detection. It is an important capability to support the determination of security responsibilities between the host apps and TPLs, and a critical detection method to promote the development of TPL security compliance. Unfortunately, the existing studies on TPL mainly focus on TPL detection, privacy leakage caused by TPLs, etc. To the best of our knowledge, there is little research on TPL traffic identification. To this end, we propose a new framework, named LibCapture, to identify TPL traffic. The framework first designs a method to automatically generate TPL encrypted traffic datasets based on dynamic hooking and TPL detection techniques. Secondly, for privacy protection and data sharing, we propose a CNN-based federated learning (FL) model to identify TPL traffic. Finally, we apply our framework to 2327 real-world apps, and the results show that our proposed framework can achieve high TPL traffic identification accuracy, and demonstrate that FL can achieve similar accuracy compared with the non-FL method. In addition, to study how the local datasets of participants influence the global model during model aggregation, we analyze the impact made on the global model when the participants in FL have different local datasets and point out the further research direction in different scenarios.

Key words: encrypted traffic identification|third-party library|federated learning|dynamic hooking