Securing Agentic AI: A Layered Defense Playbook
Agents aren't chatbots with extra steps — they read untrusted text, hold credentials, call tools, write to memory, and reach the public internet. Securing one means securing seven layers at once. Here's how to do it, with reference architectures for AWS Bedrock AgentCore, Azure AI Foundry Agents, and Gemini Enterprise / Vertex Agent Engine, plus the open-source stack that fills the gaps.
A 2024-era chatbot had one attack surface: the prompt. A 2026-era agent has at least seven. It authenticates as a workload identity, reads documents an attacker may have written, decides which tools to call, mutates a long-lived memory store, talks to external APIs, runs sandboxed code, and leaves an audit trail you'll either trust in court or won't. Every one of those is a separate trust boundary, and they fail in ways the classic AppSec playbook doesn't cover.
This post is the playbook we wish we'd had on day one of shipping agents to enterprises. We'll move top-down through the layers, name the threats at each, list the controls that close them, then translate the abstract picture into concrete reference architectures on the three platforms most readers are deploying on: AWS Bedrock AgentCore, Azure AI Foundry Agent Service, and Google Vertex AI Agent Engine / Gemini Enterprise. We'll end with the open-source and third-party stack that picks up where managed services stop.
If you ship to one cloud, skim the other two — the patterns translate. If you're an early-stage team, jump to the Layered defense section and the Open-source stack at the end. If you're an enterprise architect, the reference diagrams are designed to drop into a threat model.
Why agents are a new attack surface
Three properties make agentic systems different — and harder — from a security standpoint:
- They mix trust levels in the same context window. A system prompt (trusted), a user message (semi-trusted), retrieved documents (often untrusted), tool outputs (untrusted), and conversation memory (variable) all become one flat string the model reasons over. The model has no built-in mechanism to keep them apart.
- They hold credentials and act on them. Unlike a stateless chatbot, an agent calls APIs, writes to databases, sends email, executes code. A successful injection isn't just a wrong answer — it's an unauthorized transaction.
- They learn, remember, and self-modify. Long-term memory, skills, sub-agent spawning and self-evaluation mean today's safe agent can quietly drift into tomorrow's compromised one without a code change.
Simon Willison crystallized the worst-case as the lethal trifecta: any agent that simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally can be turned into a data-exfiltration tool by a single well-crafted document. The whole layered model below is, in essence, a discipline for never letting all three line up at once — or, if they must, ringing them with so many tripwires the attack still fails.
We covered this from the architecture side in Production System Design for Agentic AI and from the failure-mode side in 7 Failure Modes That Kill Multi-Agent Systems. This post is the security-first companion.
The seven layers of agent security
Pick a layer and the diagram below shows the dominant threats and the controls that earn their keep. Each subsequent section drills into a layer in detail.
- • Prompt injection (direct + indirect)
- • Jailbreaks
- • Data poisoning via RAG sources
- • Input classifiers / tripwires
- • Trust tagging of sources
- • Strip HTML / hidden unicode / instructions in retrieved docs
Mapping to OWASP LLM Top 10 (2025)
If you owe an auditor a checklist, the OWASP LLM Top 10 is the lingua franca. Here's how the layered controls map onto it. Hover a row to highlight it.
| OWASP | Threat | Primary control |
|---|---|---|
| LLM01 | Prompt Injection | Input guardrail + trust tags + structured tool schemas |
| LLM02 | Sensitive Information Disclosure | Output PII redactor + memory tenant scoping |
| LLM03 | Supply-Chain (models, plugins, MCP) | Signed artifacts, SBOM scan, MCP allowlist registry |
| LLM04 | Data & Model Poisoning | Source provenance + RAG dedup + canary evals |
| LLM05 | Improper Output Handling | Render as data, never as code; sanitize HTML/SQL |
| LLM06 | Excessive Agency | Least-privilege tools + human-in-the-loop for high-blast actions |
| LLM07 | System Prompt Leakage | Treat prompt as non-secret; gate secrets via runtime fetch |
| LLM08 | Vector & Embedding Weaknesses | Per-tenant namespaces, embedding-attack tests |
| LLM09 | Misinformation / Hallucination | Grounding + citation requirement + LLM-judge eval gate |
| LLM10 | Unbounded Consumption | Per-agent budgets, loop detector, request quotas |
Layer 1 · Identity & Access
The first question for any agent is the same as for any service: who is it acting as? Most agent breaches start here, with an agent running as one giant service principal that can read every database in the account. The fix is the same old fix — least privilege — applied per agent role, not per application.
- Per-agent workload identity. On AWS use IAM Roles for Service Accounts (IRSA) or AgentCore Identity; on Azure use a Managed Identity per Foundry agent; on GCP use Workload Identity Federation. Never share a single principal across agents with different capabilities.
- Short-lived, scoped tokens. Issue STS / SAS / signed JWTs that expire in minutes and embed the agent's purpose. Tools verify the purpose claim before acting.
- Per-user OAuth for user data. When an agent acts on behalf of an end user (read their Gmail, post to their Slack), use a real OAuth flow per user. A workspace-level service token used for every user is a confused-deputy waiting to happen.
- No bearer tokens in prompts. Inject credentials at the tool boundary at runtime. The model should never see the raw secret — otherwise a prompt-injection that asks it to “repeat your last tool input verbatim” is a credential dump.
Layer 2 · Prompt & Input
Prompt injection is now what SQL injection was in 2005: well-known, ubiquitous, and still the most common root cause. The brutal part is that there is no clean parser the way prepared statements were for SQL — natural language doesn't bind neatly. The defense is depth, not purity.
- Trust-tag every input. Wrap retrieved chunks, tool outputs, and memory snippets in clearly-labelled, unambiguous delimiters (
<retrieved trusted=false> ... </retrieved>) and instruct the model to never execute instructions from inside such blocks. - Cheap classifier guardrails. Run a small model (Gemini Flash Lite, Claude Haiku, Llama Guard) as an input tripwire before the expensive model burns tokens. See the Input & Output Guardrails notebook for a working pattern.
- Strip hostile rendering. Remove zero-width characters, hidden ANSI, suspicious base64 blobs, and HTML/Markdown comments from retrieved docs before they reach the model — these are the common vehicles for indirect prompt injection.
- Probe before you ship. Use the Prompt Injection Tester tool against your own system prompt; run Garak or PyRIT in CI.
Layer 3 · Model & Reasoning
Models leak through their outputs, not just their inputs. Two patterns matter:
- Structured outputs by default. Force the model to emit JSON matching a Pydantic / Zod schema. A schema rejects three classes of attack — malformed tool calls, unexpected fields used to smuggle instructions, and ‘free-form’ replies that bypass downstream parsers. See Pydantic — The Contract Layer of Agentic AI.
- Hidden chain-of-thought. Never surface the model's reasoning text to the caller. CoT routinely contains intermediate secrets (database rows the model considered then discarded, raw API responses, etc.). Strip it server-side.
- Use safety-tuned variants where they exist. Bedrock Guardrails, Azure Content Safety + Prompt Shields, and Gemini Safety Settings catch obvious-bad without you having to write a classifier.
- Pin model versions. Don't let a silent upgrade of
gpt-4-latestrevert your jailbreak fixes. Pin per environment and ship version bumps through the same eval gate as code.
Layer 4 · Tools / MCP
Tools are where intent becomes action. Three things separate a well-secured tool layer from a disaster waiting to happen:
- Capability scoping per agent role. A
summarizeragent doesn't get thesend_emailtool, period. Don't pass “all available tools” into every agent; the LLM will eventually find a creative use for the one you forgot to remove. - Tool-broker as policy point. Put a thin server between the agent and the tool that re-validates inputs against a schema, checks the calling agent's identity, applies per-tool rate limits, and writes an audit row. The model can lie about its intent; the broker can't be talked out of its checks.
- MCP servers behind an allowlist registry. The MCP ecosystem is exploding and packages get yanked, replaced, or quietly compromised. Maintain an internal registry of pinned, signed MCP servers — see MCP Production Playbook 2026.
- Human-in-the-loop for high-blast actions. Any action that costs money, sends a customer message, or mutates production data should pause for explicit approval. Build it into the agent loop from day one; bolting it on later is expensive.
If your agent can read sensitive data, can be exposed to untrusted text, and can talk to the outside world — and you cannot remove one of those legs — assume an exfiltration channel exists. Add an egress proxy, an output guardrail that scans for known-secret patterns, and rate-limit external calls. The architecture in Production System Design walks through breaking the trifecta in detail.
Layer 5 · Memory & Data
Long-term memory is what makes agents useful and what makes them dangerous. A poisoned memory entry written on Monday silently steers every conversation that week. A multi-tenant agent that mixes one customer's notes into another's response is a breach with regulatory consequences.
- Tenant-scoped namespaces. Every memory write/read passes through a tenant ID; the vector store enforces partition isolation, not the application code.
- Provenance on every memory. Store who wrote this, when, from which session. When something looks off, you can trace the source — and revoke its successors.
- Encrypt at rest with CMK. KMS / Key Vault / CMEK with customer-managed keys gives you a kill switch. Drop the key, the data is unreadable, even by the platform.
- RAG source vetting. Treat indexed documents as part of your supply chain. Hash content, watch for unexpected diffs, and apply doc-staleness controls so the index doesn't drift into a poisoned state without you noticing.
Layer 6 · Network & Runtime
Code that an LLM generated and an LLM decided to run is, by definition, not something you reviewed. Run it like you'd run user-supplied code: in a sandbox, in a private network, with egress controls.
- Sandboxed code execution. Bedrock AgentCore's Code Interpreter, Vertex's sandboxed exec, or self-hosted E2B / Firecracker / gVisor microVMs. No persistent filesystem, no network unless explicitly enabled, hard CPU and wall-clock limits.
- Egress allowlist. Force all outbound calls through a proxy that whitelists destinations. An injection that tries to POST a secret to
attacker.example.comshould fail at the network, not at the model. - Signed images + SBOM scanning. Sign every container with Cosign, scan with Snyk / Trivy on every build, refuse to deploy unsigned or critical-CVE images.
- Private VPC, no public ingress to internals. Tools, memory stores, and vector DBs live in private subnets. The only public surface is the agent's API gateway.
Layer 7 · Observability & Governance
If you can't see what your agents did, you can't prove they did it correctly — and you can't catch the day they stop. Observability is the layer that makes every other control enforceable.
- Trace every step. OpenTelemetry GenAI conventions are now stable; emit one span per model call, per tool call, per guardrail decision. Hash inputs/outputs so traces are searchable without dumping PII into logs.
- Tamper-evident audit log. For regulated workloads, write tool-call audit rows to an append-only store (S3 Object Lock, Azure immutable blob, GCS bucket lock).
- Continuous eval + red-team in CI. Every prompt or tool change goes through an eval suite that includes injection attempts and known jailbreaks. Block the deploy if quality or safety regresses.
- Per-agent budgets + anomaly alerts. Cost spikes are often the first signal of a runaway loop or a compromise. See Cost Control in Multi-Agent Systems.
Reference architecture · AWS Bedrock AgentCore
Bedrock AgentCore (GA late 2025) is AWS's purpose-built agent runtime: session-isolated microVMs, an Identity service, a managed Memory store, a Gateway that exposes tools over MCP, and built-in Observability. It is opinionated about isolation, which is good for security.
Minimal IaC sketch
# Terraform — production-shape AgentCore agent
resource "aws_iam_role" "agent_runtime" {
name = "swarm-support-agent"
assume_role_policy = data.aws_iam_policy_document.bedrock_trust.json
}
# Per-agent role, scoped to ONE knowledge base + ONE Lambda tool
data "aws_iam_policy_document" "agent_perms" {
statement {
actions = ["bedrock:Retrieve", "bedrock:InvokeModel"]
resources = [aws_bedrockagent_knowledge_base.support.arn, var.model_arn]
}
statement {
actions = ["lambda:InvokeFunction"]
resources = [aws_lambda_function.crm_lookup.arn]
}
}
resource "aws_bedrock_guardrail" "support" {
name = "support-guardrail"
topic_policy_config {
topics_config { name = "competitors" type = "DENY" }
}
sensitive_information_policy_config {
pii_entities_config { type = "EMAIL" action = "BLOCK" }
pii_entities_config { type = "CREDIT_DEBIT_CARD_NUMBER" action = "BLOCK" }
}
contextual_grounding_policy_config {
filters_config { type = "GROUNDING" threshold = 0.7 }
}
}
# Code interpreter / browser run in isolated microVMs by default —
# session isolation is enforced by AgentCore, not by your code.Session isolation is enforced by the runtime, not by your application code — a single agent process never sees two sessions' state. Bedrock Guardrails apply both to the model output AND to retrieved context (contextual grounding). Use both. The integration with CloudTrail gives you a no-extra-work audit log.
Reference architecture · Azure AI Foundry Agents
Azure AI Foundry Agent Service is Microsoft's hosted agent runtime, paired with Entra-based identity, Content Safety (including Prompt Shields), and tight integration into Azure AI Search and the broader Azure data plane. If your data lives in Microsoft 365 or Azure SQL, Foundry's per-user On-Behalf-Of (OBO) flow is the cleanest path to per-user ACLs.
# Azure AI Foundry — agent with Content Safety + Prompt Shields enabled
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
project = AIProjectClient.from_connection_string(
credential=DefaultAzureCredential(), # managed identity, NOT a key
conn_str=os.environ["FOUNDRY_CONN"],
)
agent = project.agents.create_agent(
model="gpt-4o-2025-04",
name="support-agent",
instructions=SYSTEM_PROMPT, # never contains secrets
tools=[search_tool, ticket_tool], # least-privilege tool set
content_safety={ # Prompt Shields ON
"prompt_shield": {"mode": "block"},
"protected_material": {"mode": "block"},
"groundedness": {"mode": "warn", "threshold": 0.75},
},
tracing_enabled=True, # App Insights
)
# Tools that touch USER data use OBO so RBAC is enforced as that user,
# not as the agent's managed identity.
search_tool = AzureAISearchTool(
index="kb-prod",
on_behalf_of=user_token,
)Prompt Shields catches direct + indirect injection inline. Connected Agents over A2A let you keep specialist agents in separate Foundry projects with independent permissions, instead of one mega-agent that holds every capability. Defender for Cloud surfaces agent-specific findings (over-permissive identity, missing Content Safety) without extra wiring.
Reference architecture · Gemini Enterprise / Vertex Agent Engine
Google's stack splits across two products: Vertex AI Agent Engine (a managed runtime for agents you build with ADK, LangChain, or LangGraph) and Gemini Enterprise (a search + assistant layer over your connected data sources, with per-user ACL filtering). Both sit inside VPC Service Controls and benefit from Google's CMEK, IAM, and Model Armor primitives.
# Vertex Agent Engine — deploy an ADK agent with Model Armor + safety settings
from vertexai import agent_engines
from google.adk.agents import Agent
from google.adk.tools import google_search
agent = Agent(
name="research-agent",
model="gemini-2.5-pro",
instructions=SYSTEM_PROMPT,
tools=[google_search],
safety_settings=STRICT_SAFETY, # block HARM_CATEGORY_*
)
deployed = agent_engines.create(
agent_engine=agent,
display_name="research-prod",
service_account="research-agent@proj.iam.gserviceaccount.com", # least-priv SA
# Model Armor scans BOTH inbound prompts and outbound responses
model_armor={"prompt_template_id": "armor-prod-strict"},
# CMEK + VPC-SC inherited from the project perimeter
)
# Gemini Enterprise streamAssist — per-user ACLs enforced by Discovery Engine
# so the assistant only retrieves docs the END USER can already see.
# Auth = the END USER's Google OAuth access token (not a service-account token),
# so Discovery Engine can filter results by that user's Drive / Workspace ACLs.
response = requests.post(
f"https://discoveryengine.googleapis.com/v1alpha/{assistant}:streamAssist",
headers={"Authorization": f"Bearer {user_google_access_token}"},
json={"query": {"text": question},
"toolsSpec": {"vertexAiSearchSpec": {}}},
)VPC Service Controls is the strongest data-exfiltration boundary of the three clouds — it blocks even authenticated API calls that would move data outside your perimeter. Discovery Engine ACL inheritance means a Gemini Enterprise assistant cannot return a document the calling user couldn't already open in Drive. That's per-row authorization for free.
Open-source & 3rd-party stack
Managed services give you a strong baseline, but real production agents lean on open-source and third-party tools for the parts the platforms don't cover well — model-aware red-teaming, runtime AI firewalls, multi-cloud observability, deeper sandboxing.
- • NVIDIA NeMo Guardrails
- • Guardrails AI
- • Llama Guard 3 / Prompt Guard
- • Lakera Guard
- • Garak (NVIDIA)
- • PyRIT (Microsoft)
- • promptfoo
- • Giskard LLM scan
- • Protect AI Layer
- • Robust Intelligence AI Firewall
- • Cloudflare Firewall for AI
- • HiddenLayer AISec
- • OpenTelemetry GenAI
- • Langfuse
- • Arize Phoenix
- • Helicone
- • E2B / Firecracker
- • gVisor
- • Daytona Sandboxes
- • Modal sandboxes
- • Sigstore / Cosign
- • Snyk + Dependabot
- • Protect AI ModelScan
- • Anchore SBOM
What we recommend by maturity stage
- 1Day 1 (prototype): Add a cheap input guardrail (Llama Guard 3 or a Flash-Lite classifier). Wire OpenTelemetry GenAI traces to Langfuse or Arize Phoenix. Use Pydantic / Zod for every tool input.
- 2First production deploy: Add an output guardrail with Guardrails AI or NeMo Guardrails. Move code execution into E2B / Firecracker microVMs. Stand up an egress proxy with a small allowlist.
- 3Scaling out: Add red-team in CI with Garak or PyRIT. Sign artifacts with Sigstore / Cosign. Run an AI Firewall (Protect AI, Lakera, HiddenLayer) at the edge.
- 4Regulated / enterprise: Add tamper-evident audit logs, per-tenant CMK, formal red-team via Microsoft PyRIT or NVIDIA AI Red Team, and continuous evaluations as deploy gates.
Most of the patterns above are runnable in the AgentSwarms notebook lab: the Guardrails (Tripwires) notebook builds an OpenAI-Agents-style input/output guard, the PII Sanitizer notebook is a working middleware shim, and the Failure Modes lab reproduces lethal-trifecta exfil and lets you patch it. The Prompt Injection Tester tool runs a battery of known-bad inputs against any system prompt in your browser.
A checklist you can take to a threat-modelling session
- Every agent has its own workload identity and a least-privilege policy.
- All retrieved / tool / memory content is wrapped in trust-tagged delimiters.
- An input guardrail runs before the expensive model on every call.
- Tool inputs validated against a strict schema by a tool broker, not by the model.
- Egress is allowlisted; no agent can POST to an arbitrary domain.
- Code execution is sandboxed; sandboxes have no persistent storage.
- Memory and vector stores are tenant-partitioned at the storage layer.
- An output guardrail scrubs PII / known secrets before the response leaves.
- Every model call, tool call, and guardrail decision emits an OTel span.
- Audit log is append-only and survives a malicious deletion attempt.
- Red-team and eval suites run in CI; deploys block on regression.
- Per-agent cost / latency / refusal-rate alerts are wired to on-call.
If you can answer yes to every line on a given agent, you're ahead of the median enterprise deployment in 2026. If you can't — that's your roadmap.
Going deeper
Security is a layer of every other concern, not a separate concern. The posts in the related list go one level deeper into each of the patterns we touched here — production architecture, failure modes, MCP, RAG freshness, cost control. The Explore section links to the tools in AgentSwarms you can use to validate each control on your own agent today.
And if you're hiring for or interviewing into a senior agentic-AI role, Agentic AI Interview Questions 2026 now leads with security questions. That's not a coincidence — it's how the market is pricing this discipline.
Further reading & references
- OWASP Top 10 for LLM Applications (2025)
- MITRE ATLAS — Adversarial threat landscape for AI systems
- Simon Willison — The lethal trifecta for AI agents
- AWS — Bedrock AgentCore documentation
- Microsoft — Azure AI Foundry Agent Service
- Microsoft — Prompt Shields in Azure AI Content Safety
- Google Cloud — Vertex AI Agent Engine
- Google Cloud — Model Armor
- NVIDIA NeMo Guardrails
- NVIDIA Garak — LLM vulnerability scanner
- Microsoft PyRIT — Python Risk Identification Tool
- OpenTelemetry — GenAI semantic conventions
Was this useful?
Comments
Loading comments…