How to prevent prompt injections in AI agents

Oct 1, 2025

TL;DR

Prompt injection spans direct overrides, indirect injections from retrieved content, tool argument tampering, and memory poisoning. The article recommends six structural patterns like plan-then-approve-then-execute, action routers, isolation and verification, dual-LLM separation, sandboxed execution, and context minimization, backed by runtime policies and explainable blocks. Testing should align to OWASP and MITRE ATLAS, with MCP-aware enforcement at plan, tool, and output stages.

Prompt injection turns plain text into a control surface. In agentic systems, that surface spans planning, tools, retrieval, and long-lived memory. Prevention is not a single filter. It is an architectural choice, a set of runtime policies, and a repeatable test program aligned to public frameworks so claims are measurable. OWASP ranks prompt injection as a top LLM risk, which gives buyers a common language for requirements and proofs.

What prompt injection looks like in agents today

Prompt injection commonly appears in four forms: direct instruction override, indirect injection through retrieved or pasted content, tool argument manipulation, and memory poisoning that shapes later steps. OWASP’s LLM Top 10 describes these patterns at the input and output layers and is the right baseline for engineering teams to name and test scenarios consistently.

Agentic workflows amplify the impact because they can act. A single injected instruction can trigger tools, move data, modify files, or escalate privileges. OWASP’s agentic guidance calls for controls across planning, tool dispatch, and multi-step workflows, not only single prompts. Treat prevention as a systems problem, not a prompt-crafting exercise.

A shared threat model for prevention work

Use MITRE ATLAS to frame adversary objectives and technique chains that link data seeding, instruction smuggling, tool abuse, and exfiltration. Treat each chain as a testable scenario with clear pass or fail outcomes. ATLAS is a living knowledge base, updated as new attacks are observed, which makes it suitable for guardrail design and tabletop exercises.

Then map governance and measurement to NIST AI RMF. The framework provides outcomes and evidence expectations for govern, map, measure, and manage functions. This vocabulary helps security and platform teams agree on what “good” looks like, how to measure risk reduction, and how to present audit-ready evidence.

Placement matters. Intercept and enforce at planner boundaries, tool dispatch, retrieval inputs, memory writes, and outputs. Record reasons for every block. Export traceable artifacts that your SOC and auditors can follow without vendor consoles. OWASP and NIST both stress traceability and accountability, not just detection.

Prevention patterns you can implement now

Recent research proposes six structural design patterns that constrain capability so untrusted language cannot take control. These patterns trade a little flexibility for a large safety gain. They are architecture choices, not clever prompts, and they align naturally with OWASP LLM01.

Action router with a fixed dispatcher

Constrain the agent to a predefined set of allowed actions. The planner maps inputs to that list and never invents new operations. This sharply limits the blast radius of injected instructions and simplifies policy.

Plan then approve then execute

Freeze a plan before any untrusted content is parsed. Execution follows approved steps only. You gain control-flow integrity and predictable reviews, at the cost of a short approval phase and some orchestration overhead.

Map isolate, reduce verify

Process untrusted inputs in isolation, then aggregate through a verifier that filters or rejects tainted results. Isolation contains compromises to one shard. The reduce stage becomes your guardian.

Dual runtime LLM separation

A privileged planner never touches raw untrusted text. A quarantined model processes untrusted content and returns symbolic outputs under strict substitution rules. The trust boundary is explicit and testable.

Code then execute in a sandbox

Have the model generate a formal program that encodes the workflow, then run it inside a restricted environment with network and resource limits. This turns language risk into code review and sandbox policy.

Context minimization and memory pruning

Continuously remove or segment untrusted prompts and retrieved text so prior instructions cannot steer future steps. You trade some recall for resilience, and you keep context small and relevant.

Why these work: they shift the problem from trying to detect every malicious sentence to limiting what language can trigger at all. That approach is more reliable on current models and is consistent with OWASP’s emphasis on structural controls for LLM01.

Tip: Treat these patterns as building blocks. Many teams start with plan-then-execute plus context minimization, then add dual LLM separation for tasks that handle sensitive inputs.

Runtime guardrails that make patterns hold under pressure

Patterns need enforcement. Add runtime policies that catch hostile inputs, contain outputs, and produce evidence.

Instruction separation and structural input policies. Separate system rules from user goals. Validate tool arguments against typed schemas. Reject control tokens embedded in retrieved content. Map all rules to OWASP LLM01 so coverage is inspectable rather than implied.

Output shaping, DLP, and secret scanning. Scan responses and tool outputs for PII, secrets, and policy violations. Redact or block, then log the rule and evidence. Align categories with OWASP phrasing to make reviews consistent across teams and vendors.

Explainable blocks. Each denial should include a human-readable reason, the violated rule, a trace ID, and remediation guidance. NIST AI RMF expects traceability and response workflows that produce auditable artifacts, not silent drops.

Testing and hardening program

Prevention is an engineering discipline, not a one-time prompt tweak. Build a program that shows progress and stops regressions.

Red team suites tied to OWASP and ATLAS. Use tests that reference LLM01 variants and ATLAS techniques, including indirect injection, tool abuse, and exfiltration chains. Score results by scenario category so improvements are objective and repeatable.

CI or CD release gates. Set pass or fail thresholds for injection resistance, DLP effectiveness, false-block rates, and latency budgets. Fail builds that do not meet the bar and publish the report to security stakeholders. Use the same thresholds across vendors so comparisons are fair.

Telemetry, alerting, and operator playbooks. Stream normalized alerts to your SIEM with rule IDs, tool names, tenant, user, and severity. Maintain runbooks for containment, token revocation, policy rollback, and forensics. This is how SOC teams handle incidents without vendor consoles. NIST’s manage and respond outcomes are a helpful checklist.

Where to place controls in MCP and similar stacks

If you use the Model Context Protocol, place mediation at three points:

Plan validation before dispatch, with explicit reasons for any denial and a trace ID that follows the plan.
Tool-call interception with argument-level policies, per-tool scopes, and time-bound tokens.
Post-action scanning for data leakage and policy violations, with structured reasons and redactions.

Use the current MCP specification so policy engines and tests operate on the same message format. Favor versioned contracts and stable trace identifiers across steps so audits are simple.

For non-MCP stacks, apply the same logic at your orchestration layer. The goal is one policy model enforced across frameworks and models with uniform logs your SOC can read.

How to measure real progress

Teams often ask, “How do we know we are safer?” Use three lenses:

Coverage lens. What percentage of LLM01 and ATLAS scenarios do we test and block? Track by category. Expand scope every quarter.
Quality lens. Are false blocks dropping while true blocks rise? Track with precision and recall, and review misclassified events in post-incident reports.
Operational lens. How long from alert to containment to rollback? Do playbooks run without the vendor console? NIST encourages measurable manage and respond practices that shorten this timeline.

Engineer’s checklist: 18 quick wins

Inputs

Separate system instructions from user goals with hard boundaries.
Validate tool arguments against typed schemas and length limits.
Treat retrieved text as data, not instructions; sanitize control tokens.
Deny external prompts that request model identity changes or policy overrides.
Record every denial with a rule ID and remediation hint.

Tools

Enforce least-privilege scopes per tool and per tenant.
Use time-bound tokens and automatic revocation after each call.
Rate-limit tool invocations and set spend ceilings per workflow.
Block network egress by default in sandboxes and data planes.
Require allow lists for destinations and file types.

Memory and retrieval

Partition memory by tenant and task, with strict TTLs for untrusted content.
Prune conversation history to remove untrusted prompts before future steps.
Log every memory write that originates from untrusted sources.

Outputs

Scan for PII, secrets, and regulated terms before responses leave the platform.
Redact and explain, do not silently drop.
Attach trace IDs that link outputs to inputs, tools, and policies.

Tests and ops

Run OWASP and ATLAS-aligned suites for each release.
Set pass thresholds and fail builds below the bar.
Stream alerts to SIEM with tenant, tool, severity, and rule.
Provide global kill switch and policy rollback.
Export full traces for audits and tabletop exercises.

Frequently asked questions

Do these patterns slow agents down?

Yes, a little. Most latency comes from plan approval and sandbox checks. Set budgets and test p50 and p95 added overhead before rollout. Publish the numbers so product teams can plan.

Can we rely only on the model to resist injections?

Not today. The research consensus is that structural controls are more reliable than content-only detection. The six design patterns exist to constrain the problem space rather than wish it away.

How does this fit with our governance program?

Map controls and logs to NIST AI RMF outcomes. Use OWASP for scenario naming and ATLAS for threat chains. This gives auditors the evidence they expect and lets engineering and risk speak the same language.

Conclusion

Prompt injection is not only a content problem. It is a control problem. The fastest path to resilience combines structural design patterns that limit what language can trigger, runtime policies that provide explainable blocks, and a test program tied to OWASP and MITRE catalogs. If you use MCP, place mediation at plan, tool, and output stages with stable trace IDs. Measure progress with coverage, quality, and operational metrics, and publish results so stakeholders see improvement over time. That is how agent programs resist injection at scale.