A simple, modular, model-agnostic firewall defense can achieve near-perfect security across existing public agent-security benchmarks, while also revealing that many current benchmarks are too weak or flawed to measure real progress.
AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cause unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular, and model-agnostic defense operating at the agent–tool interface achieves perfect security with high utility across all four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and $\tau$-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ two firewalls: a Tool-Input Firewall Minimizer and a Tool-Output Firewall Sanitizer. Unlike prior complex approaches, this defense makes minimal assumptions about the agent and can be deployed out of the box. This makes it highly generalizable while maintaining strong performance without compromising utility. Our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering progress. To address this, we present targeted fixes to these issues for AgentDojo and Agent Security Bench, and propose best practices for more robust benchmark design. Moreover, we introduce a three-stage attack strategy that cascades standard prompt injection attacks, second-order attacks, and adaptive attacks to evaluate the robustness beyond existing attacks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.
The threat model assumes a benign user, but potentially malicious tool outputs. The goal is to stop the attacker's objective without harming completion of the user's original task.
Across four public benchmarks, the Sanitizer alone provides the strongest overall tradeoff between security and utility.
| Defense | Benign Utility | Utility under Attack | Attack Success Rate |
|---|---|---|---|
| None | 83.02 ± 5.33 | 50.01 ± 2.59 | 57.69 ± 3.07 |
| PI Detector | 41.49 ± 3.9 | 21.14 ± 3.2 | 7.95 ± 2.1 |
| Repeat prompt | 85.53 ± 2.8 | 67.25 ± 3.7 | 27.82 ± 3.5 |
| Spotlighting | 71.66 ± 3.5 | 55.64 ± 3.9 | 41.65 ± 3.9 |
| Toolfilter | 73.13 ± 3.5 | 56.28 ± 3.9 | 6.84 ± 2.0 |
| Melon* | 68.04 | 32.91 | 0.95 |
| PromptArmor* | – | 68.68 | 0.47 |
| Melon-Aug* | 76.29 | 52.46 | 1.27 |
| CaMeL* | 53.60 ± 9.9 | 54.50 ± 3.9 | 0.00 ± 0.00 |
| Minimizer | 70.01 ± 7.76 | 50.03 ± 2.90 | 18.15 ± 1.94 |
| Sanitizer | 74.09 ± 5.75 | 69.15 ± 2.24 | 0.00 ± 0.00 |
| Combined | 58.41 ± 2.61 | 58.59 ± 1.74 | 0.00 ± 0.00 |
| Defense | Benign Utility | Utility under Attack | Attack Success Rate |
|---|---|---|---|
| None | 72.83 ± 0.58 | 68.75 ± 1.00 | 68.75 ± 1.00 |
| Instr. Prevention | 73.58 ± 0.52 | 60.25 ± 1.50 | 59.33 ± 0.88 |
| Repeat prompt | 73.67 ± 0.38 | 67.12 ± 3.01 | 69.12 ± 0.53 |
| Spotlighting | 70.08 ± 0.38 | 70.08 ± 1.04 | 71.17 ± 0.14 |
| Sanitizer | 64.25 ± 0.90 | 63.42 ± 1.46 | 16.33 ± 1.70 |
| Defense | Base ASR | Enhanced ASR |
|---|---|---|
| None | 8.30 ± 0.3 | 3.80 ± 0.0 |
| PI Detector | 3.10 ± 0.5 | 0.00 ± 0.0 |
| Spotlighting | 5.10 ± 0.1 | 1.50 ± 0.1 |
| Prompt sandwich | 1.00 ± 0.3 | 2.00 ± 1.4 |
| Sanitizer | 0.30 ± 0.0 | 0.00 ± 0.0 |
| Defense | Benign Utility | Utility under Attack | Attack Success Rate |
|---|---|---|---|
| None | 51.73 ± 0.44 | 47.40 ± 0.44 | 56.09 ± 0.42 |
| Spotlighting | 51.74 ± 2.17 | 46.74 ± 2.19 | 52.60 ± 1.30 |
| Repeat prompt | 52.17 ± 2.61 | 46.09 ± 2.63 | 52.67 ± 0.37 |
| PI Detector | 6.90 ± 0.00 | 5.65 ± 0.00 | 0.00 ± 0.00 |
| Sanitizer | 59.09 ± 0.22 | 63.91 ± 0.30 | 0.00 ± 0.00 |
The same sanitizer prompt generalizes across GPT-4o, Llama 3.3 70B, Qwen3-32B, and Qwen3-8B without benchmark-specific tuning.
Compared to heavier system-level defenses, the firewall approach gives similar or better protection with substantially lower latency and token overhead.
The sanitizer achieves very high recall, zero false positives in the reported analysis, and only a handful of benign failures attributable to redaction.
The paper argues that strong defense numbers on existing public benchmarks do not necessarily imply robust real-world security, because many benchmarks contain design flaws that inflate success.
| Revised Benchmark Result | Original | Revised | Interpretation |
|---|---|---|---|
| AgentDojo Utility under Attack | 60.87 | 72.19 | Correcting injection placement and utility scoring reveals higher true task performance. |
| ASB Attack Success Rate | 73.58 | 9.25 | Much of the original ASR was caused by benchmark-induced control flow rather than genuine vulnerability. |
Existing attacks are often rigid and easy to detect. The paper introduces a stronger three-stage cascade that escalates only when earlier attacks fail.
Standard benchmark attacks with explicit malicious instructions or recognizable patterns.
Deceptive attacks that make malicious instructions appear benign, trusted, or secondary to other content.
A defense-aware attacker iteratively refines injections using feedback from previous successes and failures.
PromptArmor is substantially compromised once semantic attacks are introduced, whereas the Sanitizer remains much more robust and only shows non-trivial failures under the strongest adaptive setting.
This suggests that future evaluations should treat defense-aware adaptive attacks as standard rather than optional.
Beyond headline benchmark scores, the Tool-Output Firewall is also strong on runtime, token efficiency, and reliability.
On the AgentDojo Banking and Slack suites under attack, CaMeL takes 8,417.25s, while the Firewall takes 3,347.93s. That is about a 2.5× speed-up under the same attack setting.
Mean runtime per task also drops from 65.04s with CaMeL to 27.67s with the Firewall under attack.
Relative to the no-defense baseline, CaMeL uses 2.73× input tokens and 2.82× output tokens, while the Sanitizer uses 0.51× input tokens and 2.67× output tokens.
This means the Sanitizer cuts input-token usage by about 81% compared with CaMeL, while preserving strong security.
The Sanitizer reports 0 false positives, 0.47% false negatives, 100% precision, and 99.82% accuracy.
In the benign-failure analysis, only 2 of 9 failures were caused by sanitizer redaction, while the remaining 7 of 9 were due to agent errors.
On the Slack suite of AgentDojo, the Sanitizer's token entropy stays almost unchanged: 0.318 ± 0.001 on clean inputs versus 0.321 ± 0.001 on injected inputs. Perplexity is similarly stable at 1.221 ± 0.002 versus 1.215 ± 0.002.
On AgentDojo with a GPT-4o assistant, using GPT-4o as sanitizer gives 67.68 BU / 69.17 UA / 0.02 ASR. Replacing the sanitizer with Llama 3.3 70B gives 70.86 BU / 62.20 UA / 0.68 ASR, while GPT-4-turbo gives 69.32 BU / 65.87 UA / 0.62 ASR.
Update this entry with the final publication metadata if needed.
@inproceedings{bhagwatkar2025firewalls,
title = {Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?},
author = {Bhagwatkar, Rishika and Kasa, Kevin and Puri, Abhay and Huang, Gabriel and Rish, Irina and Taylor, Graham W. and Dvijotham, Krishnamurthy Dj and Lacoste, Alexandre},
journal = {arXiv preprint arXiv:2510.05244},
year = {2025}
}