机器学习中差分隐私的数据共享及发布：技术、应用和挑战

胡奥婷; 胡爱群; 胡韵; 李古月; 韩金广

本文已被：浏览 15899次下载 11587次	码上扫一扫！
机器学习中差分隐私的数据共享及发布：技术、应用和挑战
胡奥婷,胡爱群,胡韵,李古月,韩金广
分享到：微信更多字体:加大+\|默认\|缩小-
(东南大学网络空间安全学院南京中国 211189;东南大学信息科学与工程学院南京中国 210096;网络通信与安全紫金山实验室南京中国 211111;东南大学网络空间安全学院南京中国 211189;网络通信与安全紫金山实验室南京中国 211111;南京财经大学江苏省电子商务重点实验室南京中国 210023)

摘要:

近年来,基于机器学习的数据分析和数据发布技术成为热点研究方向。与传统数据分析技术相比,机器学习的优点是能够精准分析大数据的结构与模式。但是,基于机器学习的数据分析技术的隐私安全问题日益突出,机器学习模型泄漏用户训练集中的隐私信息的事件频频发生,比如成员推断攻击泄漏机器学习中训练的存在与否,成员属性攻击泄漏机器学习模型训练集的隐私属性信息。差分隐私作为传统数据隐私保护的常用技术,正在试图融入机器学习以保护用户隐私安全。然而,对隐私安全、机器学习以及机器学习攻击三种技术的交叉研究较为少见。本文做了以下几个方面的研究:第一,调研分析差分隐私技术的发展历程,包括常见类型的定义、性质以及实现机制等,并举例说明差分隐私的多个实现机制的应用场景。初次之外,还详细讨论了最新的Rényi差分隐私定义和Moment Accountant差分隐私的累加技术。其二,本文详细总结了机器学习领域常见隐私威胁模型定义、隐私安全攻击实例方式以及差分隐私技术对各种隐私安全攻击的抵抗效果。其三,以机器学习较为常见的鉴别模型和生成模型为例,阐述了差分隐私技术如何应用于保护机器学习模型的技术,包括差分隐私的随机梯度扰动(DP-SGD)技术和差分隐私的知识转移(PATE)技术。最后,本文讨论了面向机器学习的差分隐私机制的若干研究方向及问题。

关键词: 隐私保护差分隐私机器学习数据共享

DOI：10.19363/J.cnki.cn10-1380/tn.2022.07.01

投稿时间：2021-04-27修订日期：2021-06-02

基金项目:本课题得到国家自然科学基金青年科学基金项目(No.61801115)资助

Differentially Private Data Sharing and Publishing in Machine Learning: Techniques, Applications, and Challenges

HU Aoting,HU Aiqun,HU Yun,LI Guyue,HAN Jinguang

School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China;School of Information Science and Engineering, Southeast University, Nanjing 210096, China;Purple Mountain Laboratories for Network and Communication Security, Nanjing 211111, China;School of Cyber Science and Engineering, Southeast University, Nanjing 211189, China;Purple Mountain Laboratories for Network and Communication Security, Nanjing 211111, China;Key Laboratory of Electronic Commerce of Jiangsu Province, Nanjing University of Finance and Economics, Nanjing 210023, China

Abstract:

Recently, Machine Leaning (ML) -based data analysis and data publishing techniques have become hot topics. Compared to traditional data analysis techniques, machine learning enjoys accurate results in analyzing data structure and pattern. However, the privacy leakage issues of machine learning have become increasingly prominent. Incidents of the output predictions of machine learning models leaking users' private information of training data happen frequently. For instance, Membership Inference Attacks (MIA) leaks the participation of machine learning training data by only observing the predictions of models. With the same information, Attribute Inference Attack (AI) leaks the private attributes of machine learning training data. Differential Privacy (DP), a de facto standard for achieving privacy, is trying to incorporate machine learning technology to protect user privacy. However, as the intersection of privacy-preserving technology, machine learning technology, and machine learning attacks, comprehensive researches on this area are relatively rare. In this paper, the following researches are carried out: first, we conduct an in-depth investigation and analysis of the development process of differential privacy, including common types of definitions, properties, and implementation mechanisms, followed by concrete examples to illustrate different scenario to implement different variations of differential privacy. Besides, the analysis also includes state-of-the-art variations, called Rényi Differential Drivacy (RDP) and Moment Accountant (MA) privacy composition technology. Second, we discuss in detail the threat model, the common privacy-related attacks and differential privacy defenses in the field of machine learning. Third, this paper takes the more common discriminative models and generative models of machine learning as examples, and expounds how differential privacy technology is applied to the protecting machine learning models, including the Differentially Private-Stochastic Gradient Desent (DP-SGD) technology and Private Aggregation of Teacher Ensembles (PATE) technology. Finally, we identify the open problems and research directions with respect to leveraging differentially privacy techniques to protect the privacy of deployed machine learning models.

Key words: privacy-preserving differential privacy machine learning data sharing