Designing Agentic AI for Production: The Six Pillars
A demo agent needs a good prompt. A production agent needs an identity, a threat model, and a pager. Here's the system-design checklist that separates the two — ending with a real LangGraph swarm deployed on AWS Bedrock AgentCore.
Here's the moment that humbles every team building with agents: the demo is flawless. The agent researches, reasons, calls its tools, writes a beautiful answer. You ship it. And then real traffic arrives — concurrent users, hostile inputs, a model provider having a bad afternoon, a reflection loop that won't quit — and the thing that looked like magic starts behaving like what it actually is: a distributed system that happens to think.
That's the reframe this whole post rests on. An agent in production is not a prompt. It's a distributed system. The prompt is maybe 10% of the work. The other 90% — the part nobody films for the launch video — is identity, security, scale, failover, observability, and cost. Get those wrong and it doesn't matter how clever your prompt was; you've shipped a liability with a chat interface.
We're going to walk six pillars, one at a time, with the specific failure each one prevents. Then we'll do the thing most articles skip: take a concrete LangChain/LangGraph multi-agent system and actually deploy it on AWS Bedrock AgentCore, mapping each pillar to a real service you can provision. Let's start with the map.
Pillar 1 — Identity: an agent is not its user
The first mistake almost everyone makes is letting the agent borrow the human's identity. The user is logged in, the agent runs 'as them', and it inherits every permission that person has. It feels convenient. It's the single most dangerous shortcut in the stack.
Agents are a new class of actor — non-human identities — and they're multiplying faster than the humans they serve. Each one needs its own identity: a workload credential, a narrowly scoped role, and short-lived tokens it can't hoard. When something goes wrong, you need the audit log to say which agent did what, distinct from any human. And when an agent is inevitably compromised, the blast radius should be the two tools it was granted — not everything its operator could touch.
One identity per agent (or per agent role), not per human. Scope permissions to the specific tools it needs. Issue short-lived, automatically-rotated tokens. Support delegation/on-behalf-of so you can prove the chain of 'the user asked → this agent acted'. And log the agent identity on every tool call.
Pillar 2 — Security: assume every input is hostile
Traditional apps trust their own code and distrust user input. Agents blow that model up, because the 'input' now includes the contents of every document, web page, and tool result the agent reads — any of which can contain instructions. Prompt injection isn't an edge case; it's the default condition of an agent that touches the outside world.
The clearest way to reason about the worst case is Simon Willison's lethal trifecta: an agent becomes capable of leaking your data the moment it simultaneously has (1) access to private data, (2) exposure to untrusted content, and (3) a way to communicate externally. Any two are survivable. All three together means a single poisoned document can read your secrets and ship them out the door.
- Least privilege, enforced server-side — the model can ask to call any tool; your server decides whether it's allowed, validates the arguments against a strict schema, and refuses anything out of scope.
- Guardrails on both ends — filter inputs (injection, jailbreaks, PII) and outputs (leaked secrets, unsafe content) with a dedicated layer, not vibes in the system prompt.
- Sandbox anything that executes — code interpreters and browsers run in isolated, ephemeral environments with no standing access to your network.
- Treat tool results as untrusted — a web page or a retrieved chunk is data, not instructions. Keep it out of the privileged instruction channel.
- Human approval for irreversible actions — refunds, deletes, sends, payments. The agent proposes; a policy (or a person) disposes.
Pillar 3 — Scalability: keep agents stateless
The fastest way to build an agent that can't scale is to keep its state — conversation history, scratchpad, plan — in memory on the process that's serving it. It works beautifully for one user. Then traffic arrives, you try to add a second instance, and you discover every session is glued to the box that started it.
Production agents are stateless compute over externalized state. The agent process holds nothing durable; conversation and working memory live in a shared store (a database, a cache, a managed memory service). Any worker can resume any session. Long-running tasks go on a queue and run asynchronously instead of holding a request open for ten minutes. Now scaling is just adding workers.
You can scale your own compute infinitely and still hit a wall: provider rate limits and token throughput. Budget for them — request-level rate limiting, queue backpressure, and caching of repeated calls — or your 'scalable' system just moves the bottleneck to the model API.
Pillar 4 — High availability: plan for the bad afternoon
Models fail. Providers rate-limit, regions degrade, a deploy goes sideways. The question isn't whether your dependencies will have a bad afternoon — it's what your agents do when they inevitably do. A system with no answer to that question is a system that goes fully dark the first time a single upstream hiccups.
- Model failover — a prioritized list of models/providers, so a 429 or outage on the primary degrades to a secondary instead of failing the request.
- Retries with backoff + idempotency — transient errors get retried, but tool calls carry idempotency keys so a retry doesn't double-charge a card or send two emails.
- Circuit breakers — when a dependency is clearly down, stop hammering it; fail fast and shed load rather than pile up timeouts.
- Checkpoint long tasks — a multi-step agent should persist its state between steps so a crash resumes instead of restarting from zero.
- Graceful degradation — when the fancy path is unavailable, return a smaller, honest answer rather than an error page.
Pillar 5 — Observability: you can't debug what you can't see
Agent bugs are almost never visible in the final output and almost always obvious in the trace. 'Why did it call that tool?' 'Why did the answer drift?' 'Where did the cost come from?' — these have answers you can read only if you captured every Thought, Action, and Observation along the way. Skip instrumentation and your debugging strategy becomes re-running the agent and hoping.
Lean on the emerging standard rather than rolling your own: the OpenTelemetry GenAI semantic conventions define how to trace LLM and agent calls, so your traces speak the same language as the rest of your infra. Capture spans per step, attach token/cost/latency as attributes, and — crucially — run evaluations in production, not just pre-launch. Quality drifts silently; an eval gate on a sample of live traffic is how you catch it before a customer does.
Pillar 6 — Cost control: bound the loops before the bill does
The unique financial risk of agents is that they decide how much work to do. A chatbot answers once. An agent can loop, fan out to sub-agents, and re-read a growing context on every turn — each iteration a fresh round of token spend. The failure mode isn't a crash; it's a quietly enormous invoice.
- Hard caps on every loop — a max-iteration limit is the safety net; an explicit stop condition (a DONE token, a passing eval) is the intended exit. Ship both.
- Model routing — use a cheap, fast model for routing, classification, and simple steps; reserve the expensive model for the work that needs it.
- Cache aggressively — identical sub-calls, repeated retrievals, and prompt prefixes are free money left on the table.
- Per-tenant attribution + budgets — tag every call with who it was for, and alert (or cut off) when a tenant blows past their budget.
- Bound the context — summarize history instead of letting it grow unbounded; a context window that only grows is a cost curve that only grows.
Model the cost before launch: iterations × agents × calls-per-step × token price × volume. AgentSwarms' Multi-Agent Token Cost Calculator does this in a few clicks — and the number it spits out often changes the architecture you were about to build.
Putting it together: a LangGraph swarm on AWS Bedrock AgentCore
Theory is cheap. Let's deploy something. Our system is a classic LangGraph multi-agent pipeline: a supervisor routes work to a researcher (which searches the web and reads documents), an analyst (which reasons over the findings), and a writer (which produces the final brief). It has tools, it has memory, and it can loop. In other words, it has every production concern we just listed.
AWS released Bedrock AgentCore to handle exactly this gap — the infrastructure between a working agent and a production one. The key thing to understand is that it's framework-agnostic and model-agnostic: AgentCore doesn't replace LangGraph, it hosts it. Your LangGraph code runs unchanged inside a managed runtime, and you opt into the surrounding services pillar by pillar.
Step 1 — Wrap the LangGraph app for the Runtime
AgentCore Runtime is the serverless host. It gives each session an isolated microVM (so one user's agent can't touch another's), scales from zero to many, and supports long-running tasks up to eight hours — which matters the moment your agent does real multi-step work. You don't rewrite your agent; you wrap its entrypoint:
# app.py — your existing LangGraph swarm, wrapped for AgentCore Runtime.
from bedrock_agentcore import BedrockAgentCoreApp
from my_swarm import build_graph # your LangGraph StateGraph, unchanged
app = BedrockAgentCoreApp()
graph = build_graph() # supervisor -> researcher / analyst / writer
@app.entrypoint
def invoke(payload, context):
# 'context' carries the session + agent identity injected by the Runtime.
result = graph.invoke(
{"messages": [("user", payload["prompt"])]},
config={"configurable": {"session_id": context.session_id}},
)
return {"output": result["messages"][-1].content}
if __name__ == "__main__":
app.run() # local dev; in prod the Runtime calls the entrypointThat's the whole adapter. The Runtime handles the HTTP surface, session isolation, scaling, and the long-running execution — Pillars 3 and 4 (scalability and availability) are now largely AWS's problem, not yours.
Step 2 — Give the agent its own identity
AgentCore Identity issues the agent a workload identity and provides a credential vault for outbound auth (OAuth tokens for the SaaS tools it calls). Your researcher agent gets its own identity to call the web-search API — not your personal key baked into an env var. That's Pillar 1, handled by the platform instead of by you copy-pasting tokens.
Step 3 — Expose tools through the Gateway
AgentCore Gateway turns your existing APIs and Lambda functions into MCP-compatible tools with authentication and authorization built in. Instead of wiring raw API keys into the agent, you register the tool once, and the Gateway enforces who can call what. Combined with Bedrock Guardrails on the model's input and output and the sandboxed Code Interpreter / Browser tools, that's Pillar 2 (security) composed from managed pieces:
# Register an existing Lambda as a governed, MCP-compatible tool.
agentcore gateway create-target \
--gateway-id my-swarm-gw \
--name web_search \
--target-type lambda \
--lambda-arn arn:aws:lambda:us-east-1:123456789012:function:web-search \
--auth-type oauth # the Gateway enforces auth on every call
# The agent now sees 'web_search' as a tool — but can only invoke it
# within the scopes its workload identity was granted.Step 4 — Persist memory, then turn on the lights
AgentCore Memory gives you managed short-term (within-session) and long-term (across-session) memory, so your agents stay stateless while their memory lives in a durable service — the externalized-state pattern from Pillar 3, as a managed dependency. And AgentCore Observability emits OpenTelemetry traces straight into CloudWatch: every step, tool call, token count, and latency, with no custom instrumentation. That's Pillars 5 and 6 — you can finally see what the swarm did and what it cost.
# Configure and deploy. The toolkit provisions the execution role,
# container, and wiring; observability is on by default.
agentcore configure --entrypoint app.py --name my-swarm
agentcore launch # builds, deploys to the Runtime, returns an ARN
# Invoke the deployed swarm
agentcore invoke '{"prompt": "Brief me on the EU AI Act timeline"}'Identity → AgentCore Identity. Security → Gateway + Guardrails + sandboxed tools. Scalability & availability → Runtime. State → Memory. Observability & cost visibility → Observability. Your LangGraph code barely changed; the production concerns became configuration.
It hosts and secures the agent; it does not make the agent good. You still own your evals, your guardrail policies and prompts, your cost budgets and alerts, and the actual reasoning quality of the swarm. The platform removes the undifferentiated heavy lifting — not the thinking.
The pre-production checklist
- 1Each agent has its own scoped identity and short-lived credentials — not the user's keys.
- 2Every tool validates arguments server-side; irreversible actions require approval.
- 3You've broken the lethal trifecta for any agent that reads untrusted content.
- 4State is externalized; agents are stateless and horizontally scalable.
- 5Model failover, retries with idempotency, and loop caps are in place.
- 6Every run emits a trace with per-step tokens, cost, and latency.
- 7An eval runs on live traffic and alerts on quality drift.
- 8Spend is attributed per tenant with budgets and alerts.
None of these pillars is exotic. They're the same disciplines we've always applied to distributed systems — identity, least privilege, statelessness, redundancy, observability, cost governance — pointed at a new kind of component that reasons and acts. The teams whose agents survive production aren't the ones with the cleverest prompts. They're the ones who treated the agent like the system it actually is.
AgentSwarms is a learning and prototyping platform: design your swarm on the visual canvas, watch it fail in the Failure-Mode Labs, and export it to LangGraph when the shape is right. Get the architecture correct here, then deploy it on a runtime like AgentCore with the six pillars already in mind.
Further reading & references
Was this useful?
Comments
Loading comments…