引用本文
  • 李嬿婌,王妍,杨海天,祝贺,刘玟,王童,黄伟庆.面向大语言模型的越狱攻击与防御综述[J].信息安全学报,已采用    [点击复制]
  • liyanshu,wangyan,yanghaitian,zhuhe,liuwen,wangtong,huangweiqing.A Survey of Jailbreak Attacks and Defenses Against Large Language Models[J].Journal of Cyber Security,Accept   [点击复制]
【打印本页】 【下载PDF全文】 查看/发表评论下载PDF阅读器关闭

过刊浏览    高级检索

本文已被:浏览 924次   下载 0  
面向大语言模型的越狱攻击与防御综述
李嬿婌, 王妍, 杨海天, 祝贺, 刘玟, 王童, 黄伟庆
0
(中国科学院信息工程研究所)
摘要:
随着大语言模型的快速发展与广泛应用,其安全性问题日益凸显,越狱攻击逐渐成为一类新兴且严峻的安全挑战。这种攻击通过特定的越狱提示绕过大语言模型的安全机制,导致敏感信息和用户隐私的泄露,同时促使模型生成和传播具有危害性的恶意内容。这不仅对基于大语言模型的应用程序造成了严重威胁,也对社会和个体带来了不良影响。面对越狱攻击手段的持续演变与多样化趋势,相应的安全防御机制也在同步发展,为大语言模型构建更为坚固的安全防线。本文首先对越狱攻击及其原理的相关概念进行阐述,然后提出了一个全面而详细的越狱攻击与防御方法分类框架。该框架系统性地梳理了现有的越狱攻击方法,从基于人类设计的攻击和基于模型生成的攻击两个角度,介绍了近年来具有代表性的越狱攻击手段,并对其原理、实施方法及优缺点进行了比较分析。同时,重点回顾了现有的越狱攻击防御机制,归纳总结能够缓解越狱攻击并提升大语言模型安全性的相关技术,根据实施范围将其分为外在安全机制和内在安全机制两类,并进一步细分为不同子类,对其各类技术的特点和效果进行了详细的总结、归纳和比较。最后展望了越狱攻击与防御技术未来的研究方向,为相关领域的研究人员和开发者提供了宝贵的参考与启示。
关键词:  生成式人工智能  大语言模型  越狱攻击  越狱攻击防御  网络安全
DOI:
投稿时间:2024-08-18修订日期:2024-12-19
基金项目:
A Survey of Jailbreak Attacks and Defenses Against Large Language Models
liyanshu, wangyan, yanghaitian, zhuhe, liuwen, wangtong, huangweiqing
(Institute of Information Engineering, Chinese Academy of Sciences)
Abstract:
As large language models rapidly evolve and find widespread application, their security issues have become increasingly prominent, with jailbreak attacks emerging as a significant and challenging threat. These attacks bypass the security mechanisms of large language models through specific jailbreak prompts, leading to the leakage of sensitive information and user privacy, and prompting the generation and dissemination of harmful malicious content. This not only poses a serious threat to applications based on large language models but also has adverse impacts on society and individuals. In response to the continuous evolution and diversification of jailbreak attack techniques, corresponding security defense mechanisms are concurrently being developed to strengthen the security of large language models.This paper begins by elucidating the concepts related to jailbreak attacks and their underlying principles. It then proposes a comprehensive and detailed classification framework for jailbreak attacks and defense methods. This framework systematically categorizes existing jailbreak attack techniques, introducing representative methods from both human-designed and model-generated perspectives, and provides a comparative analysis of their principles, implementation methods, and advantages and disadvantages. Additionally, the paper gives special attention to the review of current defense mechanisms against jailbreak attacks, summarizing technologies that mitigate these attacks and enhance the security of large language models. These technologies are categorized into external and internal security mechanisms, with further subdivision into various subcategories, offering a detailed summary, synthesis, and comparison of the characteristics and effectiveness of each technique. Finally, the paper discusses future research directions in jailbreak attack and defense technologies, providing valuable insights and guidance for researchers and developers in the field.
Key words:  generative artificial intelligence  large language model  jailbreak attack  jailbreak defense  cyber security