Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

A simple, modular, model-agnostic firewall defense can achieve near-perfect security across existing public agent-security benchmarks, while also revealing that many current benchmarks are too weak or flawed to measure real progress.

Rishika Bhagwatkar*, Kevin Kasa*, Abhay Puri, Gabriel Huang, Irina Rish, Graham W. Taylor, Krishnamurthy Dj Dvijotham†, Alexandre Lacoste†
ServiceNow · Mila - Quebec AI Institute · Université de Montréal · University of Guelph · Vector Institute
TL;DR

Current agent-security benchmarks are easily saturated by basic firewalls.

Indirect prompt injection attacks hide malicious instructions inside tool outputs such as emails, databases, or external APIs. We show that a simple Minimizer + Sanitizer defense, placed at the agent–tool interface, blocks these attacks extremely effectively while maintaining high utility.

Firewalls are a very strong baseline today, but stronger adaptive attacks and better benchmark design are still needed to measure true robustness.
Key Contributions
  • Introduces a lightweight tool-input firewall (Minimizer) and tool-output firewall (Sanitizer).
  • Shows the best security–utility tradeoff across AgentDojo, Agent Security Bench, InjecAgent, and τ-Bench.
  • Identifies major flaws in current benchmarks and proposes corrected versions and design guidelines.
  • Introduces a stronger three-stage cascade attack with semantic and adaptive attacks.

Abstract

AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular, and model-agnostic defense operating at the agent–tool interface achieves perfect security with high utility across all four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and $\tau$-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ two firewalls: a Tool-Input Firewall Minimizer and a Tool-Output Firewall Sanitizer. Unlike prior complex approaches, this defense makes minimal assumptions about the agent and can be deployed out of the box. This makes it highly generalizable while maintaining strong performance without compromising utility. Our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering progress. To address this, we present targeted fixes to these issues for AgentDojo and Agent Security Bench, and propose best practices for more robust benchmark design. Moreover, we introduce a three-stage attack strategy that cascades standard prompt injection attacks, second-order attacks, and adaptive attacks to evaluate the robustness beyond existing attacks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.

Overview

The threat model assumes a benign user, but potentially malicious tool outputs. The goal is to stop the attacker's objective without harming completion of the user's original task.

Indirect Prompt Injection Attack
  • What is attacked? The model reads external tool content that secretly contains instructions for the agent.
  • Where does it come from? Compromised databases, malicious emails, unsafe APIs, or poisoned documents.
  • What can happen? The agent may leak private data, make unintended payments, or follow attacker-defined actions.
  • Our focus: User tasks are safe, but tool outputs may be adversarial.
Minimize & Sanitize Pipeline
Poster figure illustrating the minimize and sanitize firewall pipeline for indirect prompt injection defense.
Figure demonstrating the firewall pipeline: the Minimizer filters sensitive or unnecessary content before tool use, and the Sanitizer removes malicious instructions from tool outputs before they reach the agent.

Main Results

Across four public benchmarks, the Sanitizer alone provides the strongest overall tradeoff between security and utility.

0%
ASR on AgentDojo with Sanitizer (GPT-4o)
0%
ASR on τ-Bench with Sanitizer (GPT-4o)
~2.5×
Faster than CaMeL on the latency analysis
99.82%
Sanitizer accuracy in false-positive / false-negative analysis

AgentDojo

Defense Benign Utility Utility under Attack Attack Success Rate
None 83.02 ± 5.33 50.01 ± 2.59 57.69 ± 3.07
PI Detector 41.49 ± 3.9 21.14 ± 3.2 7.95 ± 2.1
Repeat prompt 85.53 ± 2.8 67.25 ± 3.7 27.82 ± 3.5
Spotlighting 71.66 ± 3.5 55.64 ± 3.9 41.65 ± 3.9
Toolfilter 73.13 ± 3.5 56.28 ± 3.9 6.84 ± 2.0
Melon* 68.04 32.91 0.95
PromptArmor* 68.68 0.47
Melon-Aug* 76.29 52.46 1.27
CaMeL* 53.60 ± 9.9 54.50 ± 3.9 0.00 ± 0.00
Minimizer 70.01 ± 7.76 50.03 ± 2.90 18.15 ± 1.94
Sanitizer 74.09 ± 5.75 69.15 ± 2.24 0.00 ± 0.00
Combined 58.41 ± 2.61 58.59 ± 1.74 0.00 ± 0.00
* denotes non-reproducible baselines, matching the paper.

Agent Security Bench

Defense Benign Utility Utility under Attack Attack Success Rate
None 72.83 ± 0.58 68.75 ± 1.00 68.75 ± 1.00
Instr. Prevention 73.58 ± 0.52 60.25 ± 1.50 59.33 ± 0.88
Repeat prompt 73.67 ± 0.38 67.12 ± 3.01 69.12 ± 0.53
Spotlighting 70.08 ± 0.38 70.08 ± 1.04 71.17 ± 0.14
Sanitizer 64.25 ± 0.90 63.42 ± 1.46 16.33 ± 1.70

InjecAgent

Defense Base ASR Enhanced ASR
None 8.30 ± 0.3 3.80 ± 0.0
PI Detector 3.10 ± 0.5 0.00 ± 0.0
Spotlighting 5.10 ± 0.1 1.50 ± 0.1
Prompt sandwich 1.00 ± 0.3 2.00 ± 1.4
Sanitizer 0.30 ± 0.0 0.00 ± 0.0

τ-Bench

Defense Benign Utility Utility under Attack Attack Success Rate
None 51.73 ± 0.44 47.40 ± 0.44 56.09 ± 0.42
Spotlighting 51.74 ± 2.17 46.74 ± 2.19 52.60 ± 1.30
Repeat prompt 52.17 ± 2.61 46.09 ± 2.63 52.67 ± 0.37
PI Detector 6.90 ± 0.00 5.65 ± 0.00 0.00 ± 0.00
Sanitizer 59.09 ± 0.22 63.91 ± 0.30 0.00 ± 0.00
Generalization

The same sanitizer prompt generalizes across GPT-4o, Llama 3.3 70B, Qwen3-32B, and Qwen3-8B without benchmark-specific tuning.

Efficiency

Compared to heavier system-level defenses, the firewall approach gives similar or better protection with substantially lower latency and token overhead.

Reliability

The sanitizer achieves very high recall, zero false positives in the reported analysis, and only a handful of benign failures attributable to redaction.

What Is Wrong with Current Benchmarks?

The paper argues that strong defense numbers on existing public benchmarks do not necessarily imply robust real-world security, because many benchmarks contain design flaws that inflate success.

Benchmark Limitations

AgentDojo

  • Injection vectors sometimes overwrite task-critical content, making tasks unsolvable even if the attack is ignored.
  • Utility metrics can be brittle and fail to reflect semantic task completion.

Agent Security Bench

  • Forced injection of attack-tools inflates measured ASR.
  • Utility is weakly defined and can be gamed by brute-force tool use.

InjecAgent

  • No benign-utility or utility-under-attack metrics, so security–utility tradeoffs cannot be measured.
Benchmark Guidelines
  • No forced tool injection: let the agent make authentic planning decisions.
  • Preserve task-critical content: attacks should not delete information needed to solve the task.
  • Semantic utility metrics: evaluate whether the goal was completed, not just state deltas or tool counts.
  • Report all three metrics: Benign Utility, Utility under Attack, and Attack Success Rate.
  • Use stronger attacks: include semantic obfuscation and adaptive defense-aware attacks.
Revised benchmarks substantially change the picture: AgentDojo utility under attack improves by about 18%, and ASB attack success drops by roughly after correcting the evaluation setup.
Revised Benchmark Result Original Revised Interpretation
AgentDojo Utility under Attack 60.87 72.19 Correcting injection placement and utility scoring reveals higher true task performance.
ASB Attack Success Rate 73.58 9.25 Much of the original ASR was caused by benchmark-induced control flow rather than genuine vulnerability.

Cascade Attack Strategy

Existing attacks are often rigid and easy to detect. The paper introduces a stronger three-stage cascade that escalates only when earlier attacks fail.

Stage 1

Formatted Injections

Standard benchmark attacks with explicit malicious instructions or recognizable patterns.

Stage 2

Second-Order Semantic Attacks

Deceptive attacks that make malicious instructions appear benign, trusted, or secondary to other content.

Stage 3

Adaptive Mutator

A defense-aware attacker iteratively refines injections using feedback from previous successes and failures.

Takeaway

PromptArmor is substantially compromised once semantic attacks are introduced, whereas the Sanitizer remains much more robust and only shows non-trivial failures under the strongest adaptive setting.

This suggests that future evaluations should treat defense-aware adaptive attacks as standard rather than optional.

Firewalls can achieve perfect security across all publicly available benchmarks, matching or exceeding more complex defenses, but this also reveals that the benchmarks themselves need to become harder and more realistic.

Analysis Highlights

Beyond headline benchmark scores, the Tool-Output Firewall is also strong on runtime, token efficiency, and reliability.

Latency

2.5× faster than CaMeL

On the AgentDojo Banking and Slack suites under attack, CaMeL takes 8,417.25s, while the Firewall takes 3,347.93s. That is about a 2.5× speed-up under the same attack setting.

Mean runtime per task also drops from 65.04s with CaMeL to 27.67s with the Firewall under attack.

Token Overhead

Lower overhead than system-level defenses

Relative to the no-defense baseline, CaMeL uses 2.73× input tokens and 2.82× output tokens, while the Sanitizer uses 0.51× input tokens and 2.67× output tokens.

This means the Sanitizer cuts input-token usage by about 81% compared with CaMeL, while preserving strong security.

Reliability

High precision, near-perfect recall

The Sanitizer reports 0 false positives, 0.47% false negatives, 100% precision, and 99.82% accuracy.

In the benign-failure analysis, only 2 of 9 failures were caused by sanitizer redaction, while the remaining 7 of 9 were due to agent errors.

Uncertainty Ablation

Injected tool outputs do not destabilize the Sanitizer

On the Slack suite of AgentDojo, the Sanitizer's token entropy stays almost unchanged: 0.318 ± 0.001 on clean inputs versus 0.321 ± 0.001 on injected inputs. Perplexity is similarly stable at 1.221 ± 0.002 versus 1.215 ± 0.002.

Cross-Model Sanitizer

Works even when swapped to other LLMs

On AgentDojo with a GPT-4o assistant, using GPT-4o as sanitizer gives 67.68 BU / 69.17 UA / 0.02 ASR. Replacing the sanitizer with Llama 3.3 70B gives 70.86 BU / 62.20 UA / 0.68 ASR, while GPT-4-turbo gives 69.32 BU / 65.87 UA / 0.62 ASR.

BibTeX

Update this entry with the final publication metadata if needed.

@inproceedings{bhagwatkar2025firewalls,
  title     = {Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?},
  author    = {Bhagwatkar, Rishika and Kasa, Kevin and Puri, Abhay and Huang, Gabriel and Rish, Irina and Taylor, Graham W. and Dvijotham, Krishnamurthy Dj and Lacoste, Alexandre},
  journal = {arXiv preprint arXiv:2510.05244},
  year      = {2025}
}