差分隐私保护约束下集成分类算法的研究

贾俊杰; 邱万勇; 马慧芳

本文已被：浏览 7572次下载 5721次	码上扫一扫！
差分隐私保护约束下集成分类算法的研究
贾俊杰,邱万勇,马慧芳
分享到：微信更多字体:加大+\|默认\|缩小-
(西北师范大学计算机科学与工程学院兰州中国 730070;西北师范大学计算机科学与工程学院兰州中国 730070;桂林电子科技大学广西可信软件重点实验室桂林中国 541004)

摘要:

机器学习中的隐私保护问题是目前信息安全领域的研究热点之一。针对隐私保护下的分类问题，该文提出一种基于差分隐私保护的AdaBoost集成分类算法：CART-DPsAdaBoost （CART-Differential Privacy structure of AdaBoost）。算法在Boosting过程中结合Bagging的基本思想以增加采样本的多样性，在基于随机子空间算法的特征扰动中利用指数机制选择连续特征分裂点，利用Gini指数选择最佳离散特征，构造CART提升树作为集成学习的基分类器，并根据Laplace机制添加噪声。在整个算法过程中合理分配隐私预算以满足差分隐私保护需求。在实验中分析不同树深度下隐私水平对集成分类模型的影响并得出最优树深值和隐私预算域。相比同类算法，该方法无需对数据进行离散化预处理，用Adult、Census Income两个数据集实验结果表明，模型在兼顾隐私性和可用性的同时具有较好的分类准确率。此外，样本扰动和特征扰动两类随机性方案的引入能有效处理大规模、高维度数据分类问题。

关键词: 隐私保护差分隐私机器学习 AdaBoost CART分类树

DOI：10.19363/J.cnki.cn10-1380/tn.2021.07.07

投稿时间：2020-10-18修订日期：2020-12-18

基金项目:本课题得到国家自然科学基金项目（No.61967013），甘肃省高等学校创新能力提升项目（No.2019A-006），的资助。

Research on an Ensemble Classification Algorithm under Differential Privacy

JIA Junjie,QIU Wanyong,MA Huifang

College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China;College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China;Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China

Abstract:

In the field of information security, privacy protection based on machine learning is currently a hot topic. For classification issues under privacy protection, this paper proposes an AdaBoost ensemble classification algorithm based on differential privacy protection:CART-DPsAdaBoost (CART-Differential Privacy structure of AdaBoost). The algorithm combines the idea of Bagging in the Boosting process to increase the diversity of sampling. In the feature perturbation based on the random subspace algorithm, an exponential mechanism is used to select continuous attribute split points to construct a CART boosting tree as a base classifier for ensemble learning. And add noise according to the Laplace mechanism. In the whole algorithm process, the privacy budget is allocated reasonably to meet the differential privacy protection needs. In the experiment, the impact of the privacy level on the ensemble classification model under different tree depths is analyzed, and the optimal tree depth value and privacy budget domain are obtained. Compared with similar algorithms, this method does not need discretization preprocessing of data. The experimental results of Adult and Census Income show that the model has good classification accuracy while taking into account privacy and usability. Moreover, the introduction of two types of random schemes, sample perturbation and feature perturbation can effectively deal with large-scale and high-dimensional data classification problems.

Key words: privacy protection differential privacy machine learning AdaBoost CART classification tree