融入意图匹配的间接提示注入攻击防御方法

赵旭升; 刘书赫; 李运鹏; 孙浩

引用本文：

赵旭升,刘书赫,李运鹏,孙浩.融入意图匹配的间接提示注入攻击防御方法[J].信息安全学报,已采用 [点击复制]
zhaoxusheng,liushuhe,liyunpeng,sunhao.Defense Method Against Indirect Prompt Injection Attacks Incorporating Intent Matching[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 499次下载 0次
融入意图匹配的间接提示注入攻击防御方法
赵旭升¹, 刘书赫¹, 李运鹏², 孙浩²
0 字体:加大+\|默认\|缩小-
(1.中国电子科技集团公司第三十研究所;2.中国科学院信息工程研究所)

摘要:

基于大语言模型（LLM）的智能体通过集成外部工具与环境交互，广泛应用于多个领域。然而，该机制引入了间接提示注入（IPI）攻击风险。IPI通过向工具访问的外部数据注入恶意指令，操纵工具响应，诱导智能体执行未授权操作。现有防御通常依赖预设规则和语义分析拦截恶意指令，但仅关注响应内容，缺乏明确的判定依据，难以应对复杂攻击。为此，本文提出一种基于LLM检测器的防御方法，结合响应与用户指令之间的意图偏差，利用两阶段提示工程识别IPI攻击。第一阶段以语义密度优先级等作为筛选准则，采用内容提取提示模板引导检测器定位工具响应中的高风险文本片段；第二阶段对提取的关键片段实施安全校验提示，依据意图匹配和操作合法性等规则判定威胁。实验基于INJECAGENT基准及两个衍生数据集，测试本文方法与三类主流基线在传统和自适应IPI攻击下的防御效果。结果表明，本文方法仅依托智能体内置的基础模型，即可在关键指标上超越最佳基线：以基准数据集上的IPI攻击为例，检测准确率相对提升约39%，攻击成功率相对下降约48%。

关键词: 大模型智能体间接提示注入攻击提示工程意图匹配

DOI：

投稿时间：2025-11-27修订日期：2026-03-18

基金项目:

Defense Method Against Indirect Prompt Injection Attacks Incorporating Intent Matching

zhaoxusheng¹, liushuhe¹, liyunpeng², sunhao²

(1.The Thirtieth Research Institute of CETC;2.Institute of Information Engineering, CAS)

Abstract:

Agents powered by large language models (LLMs) interact with the environment by integrating external tools, enabling widespread applications across several domains. However, this mechanism introduces the risk of indirect prompt injection (IPI) attacks. IPI attacks inject malicious instructions into tool-accessed external data, thereby manipulating tool responses and steering agents toward unauthorized actions. Existing defenses generally use predefined rules and semantic analysis to filter malicious instructions. However, by focusing solely on responses and lacking a well-defined criterion for identifying truly malicious anomalies, these defenses fall short when confronting more intricate attacks. To overcome this, we propose a defense method based on an LLM-based detector, which detects IPI attacks by assessing the intent deviation between re-sponses and user instructions, using a two-stage prompt engineering. In the first stage, the detector is guided by criteria like semantic density priority to locate high-risk segments within tool responses using a content extraction prompt template. In the second stage, a security validation prompt is applied to the extracted key segments, which assesses threats based on criteria such as intent alignment and operational legality. Experiments were conducted using the INJECAGENT benchmark and two derived datasets to evaluate the performance of our method and three mainstream baseline families in defending against both traditional and adaptive IPI attacks. The results demonstrate that our method, relying solely on the base model in the agent, exceeds the best baseline in key metrics. For example, in the IPI attack on the benchmark dataset, detection ac-curacy improves by about 39% and attack success rate drops by approximately 48%, compared to the baseline.

Key words: large language model agent indirect prompt injection attack prompt engineering intent matching