基于Tsallis熵的近似差分隐私K-means算法

杨舒丹; 李男; 郑文娟; 杜启明

引用本文：

杨舒丹,李男,郑文娟,杜启明.基于Tsallis熵的近似差分隐私K-means算法[J].信息安全学报,2023,8(4):113-125 [点击复制]
YANG Shudan,LI Nan,ZHENG WenJuan,DU Qiming.An approximate differential privacy K-means algorithm based on Tsallis entropy[J].Journal of Cyber Security,2023,8(4):113-125 [点击复制]

本文已被：浏览 2367次下载 2093次	码上扫一扫！
基于Tsallis熵的近似差分隐私K-means算法
杨舒丹^1,2, 李男^1,2, 郑文娟³, 杜启明¹
0 字体:加大+\|默认\|缩小-
(1.战略支援部队信息工程大学网络空间安全学院郑州中国 450000;2.数学工程与先进计算国家重点实验室郑州中国 450000;32142 部队保定中国 071000)

摘要:

利用K-means算法对用户信息进行聚类时,存在隐私泄露的风险。差分隐私保护技术可提供严格的隐私保护,但目前大多数满足差分隐私的K-means算法在处理多维数据时,存在随机选择质心和噪声添加不均衡的问题,因而导致聚类结果不理想。为此,本文提出一种基于Tsallis熵的近似差分隐私K-means算法。针对质心选择的随机性问题,提出Tsallis熵对属性赋权的策略来优化对象间的欧氏距离,然后对比各对象到唯一随机初始质心的赋权欧式距离来确定其余初始质心,使算法在减少随机选择初始质心的同时,提高模型准确率;在此基础上,针对噪声添加不均衡的问题,提出一种能够平衡信噪比的隐私预算分配策略,然后对迭代质心加入高斯扰动,使算法在不增加计算复杂度的情况下满足(ε,δ)- 差分隐私保护,同时提升扰动结果的准确性;最后在四个真实数据集上对算法进行有效性评价。实验结果表明,所提出的算法能够在保证用户隐私安全的同时实现高效用的聚类。

关键词: 近似差分隐私|高斯机制|Tsallis熵|K-means聚类|数据挖掘

DOI：10.19363/J.cnki.cn10-1380/tn.2023.07.08

投稿时间：2022-01-01修订日期：2022-03-08

基金项目:本课题得到国家自然科学基金资助项目(No. 62472447)资助。

An approximate differential privacy K-means algorithm based on Tsallis entropy

YANG Shudan^1,2, LI Nan^1,2, ZHENG WenJuan³, DU Qiming¹

(1.Department of Cyberspace Security Academy, Strategic Support Force Information Engineering University, Zhengzhou 450000, China;2.State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450000, China;3.People's Liberation Army 32142 Unit, Baoding 071000, China)

Abstract:

When using the K-means algorithm to cluster user information, there is a risk of privacy leakage. Differential privacy protection technology can provide strict privacy protection. However, most of the current K-means algorithms that meet differential privacy have problems of random selection of centroids and imbalanced noise addition when processing multi-dimensional data, which leads to unsatisfactory clustering results. To this end, this paper proposes an approximate differential privacy K-means algorithm based on Tsallis entropy. Aiming at the randomness problem of centroid selection, a strategy of attribute weighting by Tsallis entropy is proposed to optimize the Euclidean distance between objects, and then the weighted Euclidean distance between each object and the unique random initial centroid is compared to determine the remaining initial centroids. This allows the algorithm to reduce the random selection of the initial centroid while improving the accuracy of the model; on this basis, a privacy budget allocation strategy that can balance the signal-to-noise ratio is proposed for the problem of noise imbalance, and then add Gaussian perturbation to the iterative centroid, so that the algorithm satisfies (ε,δ)- differential privacy protection without increasing the time complexity, and at the same time improves the accuracy of the perturbation results; finally, the effectiveness of the algorithm is evaluated on four real data sets. Experimental results show that the proposed algorithm can achieve efficient clustering while ensuring user privacy.

Key words: approximate differential privacy|Gaussian mechanism|Tsallis entropy|K-means clustering|data mining