引用本文
  • 张鑫泽,李帅,苗莉.基于深度Q网络和动态策略检索的大语言模型越狱攻击自适应策略学习[J].信息安全学报,已采用    [点击复制]
  • Zhangxinze,Li Shuai,Miao li.Adaptive Policy Learning for Large Language Model Jailbreaking via Deep Q-Networks and Dynamic Strategy Retrieval[J].Journal of Cyber Security,Accept   [点击复制]
【打印本页】 【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 83次   下载 0  
基于深度Q网络和动态策略检索的大语言模型越狱攻击自适应策略学习
张鑫泽, 李帅, 苗莉
0
(宁夏大学)
摘要:
大型语言模型的最新进展在提升其能力的同时,也使其日益面临“越狱”意图的威胁,即恶意用户通过精心设计的提示词操纵模型,使其生成有害、有偏见或其他被限制的内容。现有的防御机制主要面临两类攻击的挑战:其一是依赖触发安全响应的单步攻击,其二则是利用其他大语言模型进行低效迭代提示的多回合攻击方法。这些方法在效率和适应性上存在明显局限。为了应对这些挑战,我们引入了深度Q网络和动态策略检索(Deep Q-Networks and Dynamic Strategy Retrieval, DDSR)黑盒攻击,该框架的核心创新在于将大语言模型的越狱过程形式化为一个序列决策问题。通过一个训练有素的深度强化学习智能体,DDSR能够根据不同的恶意问题,自动地从策略库中检索并选择最优攻击策略,并在与目标模型的交互环境中持续进化与优化。这一机制实现了攻击策略从自动化学习、评估到优化的完整闭环,显著减少了对人工设计攻击策略的依赖,从而有效地瓦解了大语言模型的安全边界。此外,本文提出了一种全新的恶意提示词质量评估框架,该框架在传统单一的攻击成功率维度之上,引入了多维度的质量评估分数,从而能够更全面、精细地衡量攻击效果的真实性与危害性。从结果上看,我们的方法在Qwen、Deepseek-r1模型上实现了100% 的攻击成功率。我们跨多个开源以及闭源模型评估了我们的攻击方法。我们的框架引入了当前最先进的深度Q网络强化学习,突出了智能体在大语言模型越狱攻击方面的交互与应用,为当前大语言模型防御措施提供了新的自动化攻击思路,而不仅仅是依赖于静态的提示词过滤。
关键词:  强化学习,越狱攻击,大语言模型,人工智能安全
DOI:
投稿时间:2025-06-11修订日期:2025-11-11
基金项目:
Adaptive Policy Learning for Large Language Model Jailbreaking via Deep Q-Networks and Dynamic Strategy Retrieval
Zhangxinze, Li Shuai, Miao li
(Ningxia University)
Abstract:
The rapid advancement of large language models (LLMs) has made them increasingly susceptible to jailbreak intentions, where malicious users manipulate the models into generating harmful, biased, or otherwise restricted content. Current de-fense paradigms are primarily challenged by two types of attacks: single-step attacks that rely on triggering predefined safety responses, and inefficient multi-round attack methods that utilize iterative prompts generated by other LLMs. These approaches exhibit significant limitations in terms of both efficiency and adaptability. To address these challenges, we introduced the "Deep Q-Networks and Dynamic Strategy Retrieval" (DDSR) black-box attack. The core innovation of our framework lies in formulating the LLM jailbreaking process as a sequential decision-making problem. Leveraging a well-trained deep reinforcement learning agent, DDSR automatically selects the optimal attack strategy from a dynamic repository based on varying malicious queries and continuously evolves within the interactive jailbreaking environment. This mechanism establishes a closed loop for the automated learning, evaluation, and optimization of attack strategies, thereby substantially reducing the reliance on manually engineered attack prompts. This continuous adaptation effec-tively compromises the security boundaries of large language models.Furthermore, we propose a novel quality assessment framework for malicious prompts. This framework moves beyond the conventional single metric of Attack Success Rate (ASR) by incorporating multi-dimensional quality scores, which provide a more comprehensive and nuanced evaluation of the authenticity and potential harm of the generated content.Empirical results demonstrate that our method achieves a remarkable 100% attack success rate on models such as Qwen and Deepseek-r1. We have conducted extensive evalua-tions across a suite of both open-source and closed-source models, confirming the robust effectiveness and generalizabil-ity of our proposed attack method. Our work integrates state-of-the-art Deep Q-Network reinforcement learning into the domain of LLM security, highlighting the significant potential of autonomous agents in orchestrating sophisticated jail-breaking attacks through interactive learning. Consequently, this study provides a new perspective on automated attacks for current LLM defense measures, suggesting that safeguarding mechanisms must evolve beyond merely relying on static prompt filtering towards more adaptive and intelligent defensive strategies.
Key words:  Reinforcement learning, jailbreak attacks, large language models, artificial intelligence security