| 引用本文: |
-
王小娟,陈晋音,郑海斌,陈靖文.基于对比学习自蒸馏的深度模型水印剔除黑盒攻击[J].信息安全学报,已采用 [点击复制]
- Wang Xiaojuan,Chen Jinyin,Zheng Haibin,Chen Jingwen.Contrastive Learning and Self-Distillation based Black-box Watermark Removal Attack for Deep Models[J].Journal of Cyber Security,Accept [点击复制]
|
|
| 摘要: |
| 随着深度神经网络在自然语言处理、计算机视觉等领域的广泛应用,模型的研发成本和商业价值日益增加。因此,深度模型的知识产权保护,即防止模型被非法复制、篡改或盗用,成为企业和研究者的重要关注点。水印技术作为一种有效的版权保护手段,通过在模型中嵌入独特的身份标识,确保模型的所有权和来源可追溯。然而,已有研究对模型水印进行剔除攻击从而实现模型版权窃取,这些攻击方法仍面临着一些实际应用挑战,主要包括:在黑盒攻击场景下仍面临着数据有限、标签缺失和攻击效果不佳等问题,尤其在缺乏足够标注数据的情况下,现有方法的效果有限,难以有效去除水印,同时保持模型的主任务性能。针对这些挑战,本文提出了一种融合对比学习与自蒸馏机制的黑盒水印攻击方法,以实现高效的水印去除。该方法通过构建基于模型输出置信度的伪标签样本,并结合特征解耦技术重构水印触发器,有效区分正常样本和水印特征。进一步地,方法引入模块级特征自蒸馏和标签自蒸馏策略,通过多层次的知识对齐,削弱水印特征的影响,同时有效保持主任务性能。实验结果表明,所提方法在四种主流水印方案下,水印验证成功率平均降低超过80%;同时,模型主任务性能损失控制在3%以内。与现有水印剔除方法相比,本文方法展现了更高的攻击成功率和更强的鲁棒性,尤其在数据有限和标签缺失的环境下。 |
| 关键词: 水印攻击 对比学习 自蒸馏 黑盒模型 |
| DOI: |
| 投稿时间:2025-09-14修订日期:2025-12-07 |
| 基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目) |
|
| Contrastive Learning and Self-Distillation based Black-box Watermark Removal Attack for Deep Models |
|
Wang Xiaojuan, Chen Jinyin, Zheng Haibin, Chen Jingwen
|
| (Zhejiang University of Technology) |
| Abstract: |
| With the rapid proliferation of deep neural networks (DNNs) across natural language processing, computer vision, and a broad spectrum of intelligent applications, the development cost, engineering effort, and commercial value associated with trained models have risen dramatically. Consequently, ensuring effective intellectual property (IP) protection—particularly preventing unauthorized replication, redistribution, or malicious tampering—has become a central concern for both industry practitioners and academic researchers. Model watermarking has emerged as a promising and practical paradigm in this context, enabling the embedding of unique ownership signatures into mod-el parameters so that provenance, copyright, and accountability can be verified even after deployment. Despite these advantages, a growing body of work has demonstrated that watermarking techniques remain vulnerable to re-moval attacks. In realistic black-box scenarios, attackers must often operate with limited data, lack of labels, and suboptimal attack performance, making existing removal methods difficult to apply effectively. Moreover, many current approaches suffer from a severe trade-off: attempts to erase a watermark frequently lead to substantial deg-radation of the model’s primary task performance, which undermines their practical usability. To address these challenges, this paper introduces a novel black-box watermark removal framework that integrates contrastive learning, feature disentanglement, and multi-level self-distillation to achieve effective and reliable watermark eras-ure under limited supervision. Our method first constructs pseudo-labeled samples based on model output confi-dence and employs a feature decoupling mechanism to reconstruct and isolate watermark-related triggers from normal task features. Building on this separation, we further design module-level feature self-distillation and label self-distillation strategies that align multi-scale representations and output semantics, progressively sup-pressing watermark-induced characteristics while preserving the model’s core functional knowledge. Extensive experiments conducted on four mainstream watermarking schemes demonstrate the strong effectiveness of the pro-posed ap-proach. The watermark verification success rate is reduced by more than 80% on average, while the primary task accuracy remains within a 3% performance drop. Compared with state-of-the-art baselines, our method exhibits higher attack success rates, superior robustness in data- and label-scarce environments, and enhanced practical ap-plicability in real-world black-box settings. |
| Key words: Watermark Attack, Contrastive Learning, Self-Distillation, Black-Box Model |