All posts
ProductionArchitectureSecurity

Designing Agentic AI for Production: The Six Pillars

A demo agent needs a good prompt. A production agent needs an identity, a threat model, and a pager. Here's the system-design checklist that separates the two — ending with a real LangGraph swarm deployed on AWS Bedrock AgentCore.

AS
AgentSwarms Authors
June 4, 2026· 18 min read·
ProductionArchitectureSecurity

Here's the moment that humbles every team building with agents: the demo is flawless. The agent researches, reasons, calls its tools, writes a beautiful answer. You ship it. And then real traffic arrives — concurrent users, hostile inputs, a model provider having a bad afternoon, a reflection loop that won't quit — and the thing that looked like magic starts behaving like what it actually is: a distributed system that happens to think.

That's the reframe this whole post rests on. An agent in production is not a prompt. It's a distributed system. The prompt is maybe 10% of the work. The other 90% — the part nobody films for the launch video — is identity, security, scale, failover, observability, and cost. Get those wrong and it doesn't matter how clever your prompt was; you've shipped a liability with a chat interface.

We're going to walk six pillars, one at a time, with the specific failure each one prevents. Then we'll do the thing most articles skip: take a concrete LangChain/LangGraph multi-agent system and actually deploy it on AWS Bedrock AgentCore, mapping each pillar to a real service you can provision. Let's start with the map.

Who is this agent, and what is it allowed to be?
Skip itAgents act as the user with god-mode keys. When one is hijacked, you can't tell agent from human in the audit log — and the blast radius is everything.
The six pillars of a production agent. Click each one for the question it answers — and the failure that shows up if you skip it. None of these are optional once real users arrive.

Pillar 1 — Identity: an agent is not its user

The first mistake almost everyone makes is letting the agent borrow the human's identity. The user is logged in, the agent runs 'as them', and it inherits every permission that person has. It feels convenient. It's the single most dangerous shortcut in the stack.

Agents are a new class of actor — non-human identities — and they're multiplying faster than the humans they serve. Each one needs its own identity: a workload credential, a narrowly scoped role, and short-lived tokens it can't hoard. When something goes wrong, you need the audit log to say which agent did what, distinct from any human. And when an agent is inevitably compromised, the blast radius should be the two tools it was granted — not everything its operator could touch.

🤖 Research Agentthe user's credentials
read_kbsend_emailissue_refunddelete_userexport_db
Blast radius if hijacked: everything the human can do — including deleting users and exporting the database.
Toggle between an agent that borrows the user's credentials and one with its own scoped identity. The scoped agent simply doesn't hold the keys to the dangerous tools — so a hijack can't reach them.
The identity checklist

One identity per agent (or per agent role), not per human. Scope permissions to the specific tools it needs. Issue short-lived, automatically-rotated tokens. Support delegation/on-behalf-of so you can prove the chain of 'the user asked → this agent acted'. And log the agent identity on every tool call.

Pillar 2 — Security: assume every input is hostile

Traditional apps trust their own code and distrust user input. Agents blow that model up, because the 'input' now includes the contents of every document, web page, and tool result the agent reads — any of which can contain instructions. Prompt injection isn't an edge case; it's the default condition of an agent that touches the outside world.

The clearest way to reason about the worst case is Simon Willison's lethal trifecta: an agent becomes capable of leaking your data the moment it simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) a way to communicate externally. Any two are survivable. All three together means a single poisoned document can read your secrets and ship them out the door.

⚠ All three present → data exfiltration is possible. Break any one leg to contain it.
Toggle the three conditions. Exfiltration only becomes possible when all three are present at once — so the defensive move is to break at least one leg for any given agent.
  • Least privilege, enforced server-side — the model can ask to call any tool; your server decides whether it's allowed, validates the arguments against a strict schema, and refuses anything out of scope.
  • Guardrails on both ends — filter inputs (injection, jailbreaks, PII) and outputs (leaked secrets, unsafe content) with a dedicated layer, not vibes in the system prompt.
  • Sandbox anything that executes — code interpreters and browsers run in isolated, ephemeral environments with no standing access to your network.
  • Treat tool results as untrusted — a web page or a retrieved chunk is data, not instructions. Keep it out of the privileged instruction channel.
  • Human approval for irreversible actions — refunds, deletes, sends, payments. The agent proposes; a policy (or a person) disposes.

Pillar 3 — Scalability: keep agents stateless

The fastest way to build an agent that can't scale is to keep its state — conversation history, scratchpad, plan — in memory on the process that's serving it. It works beautifully for one user. Then traffic arrives, you try to add a second instance, and you discover every session is glued to the box that started it.

Production agents are stateless compute over externalized state. The agent process holds nothing durable; conversation and working memory live in a shared store (a database, a cache, a managed memory service). Any worker can resume any session. Long-running tasks go on a queue and run asynchronously instead of holding a request open for ten minutes. Now scaling is just adding workers.

Concurrent sessions3000
⚙️
node 1
⚙️
node 2
⚙️
node 3
⚙️
node 4
Stateless agents keep session state in a shared store, so any node can pick up any request. Add nodes, absorb load.
Drag the load up. Stateless workers backed by a shared store absorb it — any node serves any session. In-memory state pins each session to one node, and that node is your bottleneck.
Don't forget the model is a resource too

You can scale your own compute infinitely and still hit a wall: provider rate limits and token throughput. Budget for them — request-level rate limiting, queue backpressure, and caching of repeated calls — or your 'scalable' system just moves the bottleneck to the model API.

Pillar 4 — High availability: plan for the bad afternoon

Models fail. Providers rate-limit, regions degrade, a deploy goes sideways. The question isn't whether your dependencies will have a bad afternoon — it's what your agents do when they inevitably do. A system with no answer to that question is a system that goes fully dark the first time a single upstream hiccups.

🌐 incoming traffic routes to the nearest healthy region
Traffic spread across 3 healthy regions. Retries + idempotency keys mean in-flight agents resume safely.
Click a region to take it down and watch traffic reroute to the healthy ones. The same idea applies one layer up: a primary model that 429s should fail over to a secondary, not stall the whole fleet.
  • Model failover — a prioritized list of models/providers, so a 429 or outage on the primary degrades to a secondary instead of failing the request.
  • Retries with backoff + idempotency — transient errors get retried, but tool calls carry idempotency keys so a retry doesn't double-charge a card or send two emails.
  • Circuit breakers — when a dependency is clearly down, stop hammering it; fail fast and shed load rather than pile up timeouts.
  • Checkpoint long tasks — a multi-step agent should persist its state between steps so a crash resumes instead of restarting from zero.
  • Graceful degradation — when the fancy path is unavailable, return a smaller, honest answer rather than an error page.

Pillar 5 — Observability: you can't debug what you can't see

Agent bugs are almost never visible in the final output and almost always obvious in the trace. 'Why did it call that tool?' 'Why did the answer drift?' 'Where did the cost come from?' — these have answers you can read only if you captured every Thought, Action, and Observation along the way. Skip instrumentation and your debugging strategy becomes re-running the agent and hoping.

span
researcher.synthesize
model / tool
claude-sonnet
tokens
4.1k
cost · latency
$0.012 · 520ms
Click any span. Every step carries model, tokens, cost, and latency — that's what makes "why did it do that?" answerable.
A real agent trace is a waterfall of spans. Click any one: it carries the model, tokens, cost, and latency. This is the difference between 'the agent was slow' and 'the researcher's synthesis step burned 4k tokens on a 520ms call'.

Lean on the emerging standard rather than rolling your own: the OpenTelemetry GenAI semantic conventions define how to trace LLM and agent calls, so your traces speak the same language as the rest of your infra. Capture spans per step, attach token/cost/latency as attributes, and — crucially — run evaluations in production, not just pre-launch. Quality drifts silently; an eval gate on a sample of live traffic is how you catch it before a customer does.

Pillar 6 — Cost control: bound the loops before the bill does

The unique financial risk of agents is that they decide how much work to do. A chatbot answers once. An agent can loop, fan out to sub-agents, and re-read a growing context on every turn — each iteration a fresh round of token spend. The failure mode isn't a crash; it's a quietly enormous invoice.

  • Hard caps on every loop — a max-iteration limit is the safety net; an explicit stop condition (a DONE token, a passing eval) is the intended exit. Ship both.
  • Model routing — use a cheap, fast model for routing, classification, and simple steps; reserve the expensive model for the work that needs it.
  • Cache aggressively — identical sub-calls, repeated retrievals, and prompt prefixes are free money left on the table.
  • Per-tenant attribution + budgets — tag every call with who it was for, and alert (or cut off) when a tenant blows past their budget.
  • Bound the context — summarize history instead of letting it grow unbounded; a context window that only grows is a cost curve that only grows.
Estimate before you ship

Model the cost before launch: iterations × agents × calls-per-step × token price × volume. AgentSwarms' Multi-Agent Token Cost Calculator does this in a few clicks — and the number it spits out often changes the architecture you were about to build.


Putting it together: a LangGraph swarm on AWS Bedrock AgentCore

Theory is cheap. Let's deploy something. Our system is a classic LangGraph multi-agent pipeline: a supervisor routes work to a researcher (which searches the web and reads documents), an analyst (which reasons over the findings), and a writer (which produces the final brief). It has tools, it has memory, and it can loop. In other words, it has every production concern we just listed.

AWS released Bedrock AgentCore to handle exactly this gap — the infrastructure between a working agent and a production one. The key thing to understand is that it's framework-agnostic and model-agnostic: AgentCore doesn't replace LangGraph, it hosts it. Your LangGraph code runs unchanged inside a managed runtime, and you opt into the surrounding services pillar by pillar.

Identity
AgentCore Identity
Workload identity for each agent + a token vault for outbound OAuth. The agent gets its own scoped identity, not the user's keys.
Each pillar maps to a concrete AgentCore service. Click a pillar to see which one covers it. The point: you don't hand-build identity, sandboxing, and tracing — you compose them.

Step 1 — Wrap the LangGraph app for the Runtime

AgentCore Runtime is the serverless host. It gives each session an isolated microVM (so one user's agent can't touch another's), scales from zero to many, and supports long-running tasks up to eight hours — which matters the moment your agent does real multi-step work. You don't rewrite your agent; you wrap its entrypoint:

# app.py — your existing LangGraph swarm, wrapped for AgentCore Runtime.
from bedrock_agentcore import BedrockAgentCoreApp
from my_swarm import build_graph   # your LangGraph StateGraph, unchanged

app = BedrockAgentCoreApp()
graph = build_graph()              # supervisor -> researcher / analyst / writer

@app.entrypoint
def invoke(payload, context):
    # 'context' carries the session + agent identity injected by the Runtime.
    result = graph.invoke(
        {"messages": [("user", payload["prompt"])]},
        config={"configurable": {"session_id": context.session_id}},
    )
    return {"output": result["messages"][-1].content}

if __name__ == "__main__":
    app.run()   # local dev; in prod the Runtime calls the entrypoint

That's the whole adapter. The Runtime handles the HTTP surface, session isolation, scaling, and the long-running execution — Pillars 3 and 4 (scalability and availability) are now largely AWS's problem, not yours.

Step 2 — Give the agent its own identity

AgentCore Identity issues the agent a workload identity and provides a credential vault for outbound auth (OAuth tokens for the SaaS tools it calls). Your researcher agent gets its own identity to call the web-search API — not your personal key baked into an env var. That's Pillar 1, handled by the platform instead of by you copy-pasting tokens.

Step 3 — Expose tools through the Gateway

AgentCore Gateway turns your existing APIs and Lambda functions into MCP-compatible tools with authentication and authorization built in. Instead of wiring raw API keys into the agent, you register the tool once, and the Gateway enforces who can call what. Combined with Bedrock Guardrails on the model's input and output and the sandboxed Code Interpreter / Browser tools, that's Pillar 2 (security) composed from managed pieces:

# Register an existing Lambda as a governed, MCP-compatible tool.
agentcore gateway create-target \
  --gateway-id my-swarm-gw \
  --name web_search \
  --target-type lambda \
  --lambda-arn arn:aws:lambda:us-east-1:123456789012:function:web-search \
  --auth-type oauth        # the Gateway enforces auth on every call

# The agent now sees 'web_search' as a tool — but can only invoke it
# within the scopes its workload identity was granted.

Step 4 — Persist memory, then turn on the lights

AgentCore Memory gives you managed short-term (within-session) and long-term (across-session) memory, so your agents stay stateless while their memory lives in a durable service — the externalized-state pattern from Pillar 3, as a managed dependency. And AgentCore Observability emits OpenTelemetry traces straight into CloudWatch: every step, tool call, token count, and latency, with no custom instrumentation. That's Pillars 5 and 6 — you can finally see what the swarm did and what it cost.

# Configure and deploy. The toolkit provisions the execution role,
# container, and wiring; observability is on by default.
agentcore configure --entrypoint app.py --name my-swarm
agentcore launch        # builds, deploys to the Runtime, returns an ARN

# Invoke the deployed swarm
agentcore invoke '{"prompt": "Brief me on the EU AI Act timeline"}'
What you composed

Identity → AgentCore Identity. Security → Gateway + Guardrails + sandboxed tools. Scalability & availability → Runtime. State → Memory. Observability & cost visibility → Observability. Your LangGraph code barely changed; the production concerns became configuration.

What AgentCore does NOT do for you

It hosts and secures the agent; it does not make the agent good. You still own your evals, your guardrail policies and prompts, your cost budgets and alerts, and the actual reasoning quality of the swarm. The platform removes the undifferentiated heavy lifting — not the thinking.

The pre-production checklist

  1. 1Each agent has its own scoped identity and short-lived credentials — not the user's keys.
  2. 2Every tool validates arguments server-side; irreversible actions require approval.
  3. 3You've broken the lethal trifecta for any agent that reads untrusted content.
  4. 4State is externalized; agents are stateless and horizontally scalable.
  5. 5Model failover, retries with idempotency, and loop caps are in place.
  6. 6Every run emits a trace with per-step tokens, cost, and latency.
  7. 7An eval runs on live traffic and alerts on quality drift.
  8. 8Spend is attributed per tenant with budgets and alerts.

None of these pillars is exotic. They're the same disciplines we've always applied to distributed systems — identity, least privilege, statelessness, redundancy, observability, cost governance — pointed at a new kind of component that reasons and acts. The teams whose agents survive production aren't the ones with the cleverest prompts. They're the ones who treated the agent like the system it actually is.

Practice the failure modes first

AgentSwarms is a learning and prototyping platform: design your swarm on the visual canvas, watch it fail in the Failure-Mode Labs, and export it to LangGraph when the shape is right. Get the architecture correct here, then deploy it on a runtime like AgentCore with the six pillars already in mind.


Was this useful?

Comments

Sign in to join the discussion.

Loading comments…