HITLLangGraphAmazon BedrockAgentCoreProductionDeployment

Human-in-the-Loop Agents on LangGraph, Deployed to Amazon Bedrock AgentCore

An institutional-grade walkthrough of building, instrumenting, and shipping a HITL agent: from interrupt_before in LangGraph, to durable checkpointers, to a one-command deploy on Bedrock AgentCore Runtime with IAM, Guardrails, and audit-grade observability.

AgentSwarms Authors

July 2, 2026· 26 min read·—

HITLLangGraphAmazon BedrockAgentCore

A capable agent without a human in the loop is a confident intern with a corporate credit card. It will, eventually, refund the wrong customer, email the wrong list, or push the wrong migration — not because the model failed, but because nothing in the system was allowed to say wait. This is the long version of building that wait into a LangGraph agent, persisting it durably, and then shipping the whole thing onto Amazon Bedrock AgentCore Runtime so a real operator can approve, edit, or kill an action from Slack while the graph holds its breath in the background.

Human-in-the-loop (HITL) is the cheapest production safety net you can buy. Done well, it costs you milliseconds of operator time per high-risk action and saves you the one bad headline that ends the project. Done badly — bolted on as a confirm() popup at the end of a chain — it's worse than nothing: it teaches operators to rubber-stamp, and it has no idea what to do when the human says “change the amount to $99 and retry.” The difference is architectural. The good version treats the human as another node in the graph, with its own state, its own retry semantics, and its own audit trail.

TL;DR — what we're building

A LangGraph agent that drafts a customer-refund action, suspends itself before executing, persists its full state to a checkpointer, and waits for an operator verdict (approve / edit / reject) delivered over an HTTPS callback. The whole thing runs on Bedrock AgentCore Runtime — serverless, session-isolated, up to 8-hour sessions — with Guardrails on the model, IAM on every boundary, and traces in CloudWatch / X-Ray. End to end, deploy is one CLI command.

When does an agent actually need a human?

The first design move is not interrupt_before(every_node). That's HITL-as-theatre — it slows the agent to walking pace and trains operators to click through without reading. The right move is to draw a 2×2 of action risk vs model confidence and only interrupt where you have to.

low confidence

high confidence

high risk

HARD STOP

Approve

low risk

Ask + log

Auto-run

High-risk actions (refunds, prod writes, customer emails) always route through HITL — even when the model is sure.

Four quadrants, four very different defaults. High-risk actions go through HITL no matter how confident the model is. Low-risk + high-confidence work runs straight through and gets sampled later for offline review.

High risk, any confidence — refunds, prod database writes, customer-facing emails, anything regulated. Always interrupt. Confidence is irrelevant: the cost of being wrong is asymmetric.
Low risk, low confidence — fuzzy classification, ambiguous routing. Don't interrupt; log and sample. Send 1–5% to humans for offline labelling and use those labels to retrain or to tune the router.
Low risk, high confidence — auto-run. This is where the agent earns its keep. If you find yourself wanting HITL here, you usually want better evals, not more humans.
Outside policy — even high confidence shouldn't override a policy ‘no’. Encode policy as a tool the agent must call, and have that tool be the thing that interrupts.

The rubber-stamp trap

If operators approve >95% of HITL prompts without edits, your interrupt threshold is too low and you're burning attention. If they reject >40%, the agent is making proposals you've never aligned it on — fix the prompt or the tools, not the human step. Track this ratio explicitly; it's the single best health metric for a HITL system.

Why LangGraph is the right runtime for HITL

HITL needs three primitives that vanilla chains don't have: suspension, durable state, and typed resumption. LangGraph gives you all three out of the box, because every graph is just a state machine with an explicit checkpointer. Suspending an agent isn't a special feature — it's interrupt_before=["node_name"] on compile(). Persisting its state isn't a special feature — it's a BaseCheckpointSaver you wire to DynamoDB, Postgres, or Redis. Resuming it isn't a special feature — it's graph.invoke(None, config) (continue as drafted) or graph.invoke(Command(resume=patched_args), config) (continue with operator edits).

plan

tool: search

draft action

⏸ interrupt

execute

state → DynamoDB

interrupt_beforesuspends the graph, persists state, waits for a human verdict, then resumes.

The graph runs plan → search → draft_action → ⏸ interrupt → execute. The interrupt is a real node boundary: state is checkpointed, the process can die, and a different process can resume from the saved state hours later.

The minimal HITL graph

Below is the smallest LangGraph that meaningfully demonstrates HITL. It takes a customer message, plans a refund action, drafts the tool call, suspends before execution, and — once resumed — either runs the tool or records a rejection. The shape generalises directly to provisioning scripts, marketing sends, migrations, anything risky.

from typing import TypedDict, Literal, Optional
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.dynamodb import DynamoDBSaver
from langchain_aws import ChatBedrockConverse
from pydantic import BaseModel, Field

# ── 1. Typed state ────────────────────────────────────────────────
class RefundDraft(BaseModel):
    order_id: str
    amount: float = Field(gt=0, le=10_000)   # invariant the model can't break
    reason: str

class HitlState(TypedDict):
    user_message: str
    plan: Optional[str]
    draft: Optional[RefundDraft]
    verdict: Optional[Literal["approved", "rejected", "edited"]]
    result: Optional[str]

llm = ChatBedrockConverse(
    model="us.anthropic.claude-sonnet-4-5-v1:0",
    temperature=0,
    guardrail_config={"guardrailIdentifier": "refund-guard-v1", "guardrailVersion": "1"},
)

# ── 2. Nodes ──────────────────────────────────────────────────────
def plan_node(state: HitlState) -> HitlState:
    prompt = f"Customer says: {state['user_message']}\nPlan the refund in 2 lines."
    return {"plan": llm.invoke(prompt).content}

def draft_node(state: HitlState) -> HitlState:
    structured = llm.with_structured_output(RefundDraft)
    draft = structured.invoke(f"Draft a refund for: {state['plan']}")
    return {"draft": draft}

def execute_node(state: HitlState) -> HitlState:
    if state["verdict"] == "rejected":
        return {"result": "cancelled by operator"}
    d = state["draft"]
    # real call would hit the payments service; we stub it for clarity
    return {"result": f"refunded ${d.amount:.2f} for {d.order_id}"}

# ── 3. Graph + interrupt + durable checkpointer ───────────────────
graph = StateGraph(HitlState)
graph.add_node("plan", plan_node)
graph.add_node("draft", draft_node)
graph.add_node("execute", execute_node)
graph.set_entry_point("plan")
graph.add_edge("plan", "draft")
graph.add_edge("draft", "execute")
graph.add_edge("execute", END)

checkpointer = DynamoDBSaver(table_name="hitl-checkpoints")
agent = graph.compile(
    checkpointer=checkpointer,
    interrupt_before=["execute"],   # ← the HITL pause
)

Three details in that snippet matter more than they look. First, `RefundDraft` is a Pydantic model with a real upper bound (`le=10_000`) — even if the model goes off the rails, the structured-output call will refuse to materialise a $9,999,999 refund. Second, `interrupt_before=["execute"]` is what gives us our HITL pause; the graph runs plan and draft, then stops cleanly, with everything persisted. Third, `DynamoDBSaver` is a real durable checkpointer — the process can die, the container can recycle, the AgentCore session can swap underneath us, and on resume the state is still there.

Suspend, inspect, edit, resume

Once a graph is suspended, four things can happen to it. Three of them resume the graph; one halts it. A clean HITL system models all four explicitly — operators get an approve button, an edit form, a reject button, and a request more info escape hatch that loops back to the agent for clarification.

draft action:

refund_customer(order_id="A-9133", amount=149.00)

Graph is suspended. State checkpoint holds tool args, messages, and pending node id.

Click a verdict to see what the runtime actually does. Approve resumes verbatim. Edit resumes with a patched state. Reject sends a Command(goto=...) that fires a compensating node. Pending keeps the checkpoint warm.

The resumption API in code

from langgraph.types import Command

# Start a session — this will run until interrupt_before(execute) fires.
config = {"configurable": {"thread_id": "ticket-A-9133"}}
result = agent.invoke({"user_message": "I want a refund for order A-9133"}, config)

# At this point the graph is SUSPENDED. The draft is in state, ready to inspect.
snapshot = agent.get_state(config)
draft = snapshot.values["draft"]              # RefundDraft(order_id=..., amount=149.0, ...)
print("Pending action:", draft.model_dump())

# === Operator clicks "Approve" ===
agent.invoke(None, config)                    # resumes from the checkpoint, runs execute_node

# === Operator clicks "Edit" — patch the amount, then resume ===
agent.update_state(config, {"draft": draft.model_copy(update={"amount": 99.0})})
agent.invoke(None, config)

# === Operator clicks "Reject" — route to a cancel branch ===
agent.invoke(Command(goto="execute", update={"verdict": "rejected"}), config)

Two API choices are doing all the work here. agent.get_state(config) lets the operator UI render the exact tool args the agent would call — no re-running the planner to ‘re-derive what it was about to do’. And agent.update_state(...) then agent.invoke(None, config) is the canonical edit-and-resume pattern — the patched values flow through execute_node as if the model had drafted them itself, but with the operator's edits attached to the checkpoint history.

Why `thread_id` matters

Every checkpointer keys state by thread_id. Use the business identifier (ticket ID, order ID, conversation ID) — not a random UUID. This is what lets the operator pick up the conversation from anywhere, lets you audit every action against the customer record, and lets you agent.get_state_history(config) later to see every turn the graph took, including the human's edits.

Why Bedrock AgentCore Runtime (and not just a Lambda)

You can deploy a LangGraph HITL agent on a plain Lambda or an ECS task. You probably shouldn't. HITL workloads have an unusual shape: most of the wall-clock time is spent waiting for a human, but you still need session affinity, encrypted persistent state, and the ability to resume hours later from a different container. Amazon Bedrock AgentCore Runtime is purpose-built for this profile: serverless, session-isolated containers with sessions that can live up to 8 hours, with the rest of the AWS agent stack (Identity, Memory, Gateway, Browser, Code Interpreter, Observability) wired in as siblings.

Operator console

Slack / web — approve / edit / reject

API Gateway + Lambda

callback that posts the verdict to AgentCore

Bedrock AgentCore Runtime

serverless, session-isolated container (up to 8 hrs)

LangGraph app + checkpointer

interrupt_before + DynamoDB / Postgres saver

Bedrock models + Guardrails

Claude / Nova + content filters + PII redaction

Identity, KMS, CloudWatch, X-Ray

IAM auth, encrypted state, traces + audit log

The full deployment surface. AgentCore Runtime hosts the LangGraph app; AgentCore Identity handles inbound auth (Cognito / OIDC); Bedrock Guardrails sit in front of every model call; CloudWatch + X-Ray capture the trace tree end to end.

Framework-agnostic — AgentCore Runtime hosts any Python or Node agent (LangGraph, CrewAI, Strands, AutoGen, your own). The platform doesn't care; you ship a container.
Long-running sessions — up to 8 hours per session, perfect for HITL where the human might take 20 minutes. Most FaaS platforms cap you at 15.
Session isolation by default — each session runs in its own micro-VM. No noisy-neighbour state leakage between tickets.
Native model + guardrail integration — Claude, Nova, Llama via Bedrock are a single IAM hop; content filters and PII redaction live next to the runtime, not somewhere else.
Observability built in — every invocation streams to CloudWatch Logs and X-Ray traces, including the LangGraph node boundaries when you wire up OpenTelemetry.

Deploying the agent: a real, working pipeline

AgentCore deployment is unusually short. There's a CLI (bedrock-agentcore-starter-toolkit) that takes a Python entrypoint, builds an ARM64 container, pushes it to ECR, registers the runtime, and gives you back an invocation ARN. Everything below is the whole deploy script — not a slide, the actual code.

1. Build

uv pip + Dockerfile (linux/arm64)

→

2. Push

ECR repo + image scan

→

3. Configure

agentcore_runtime.configure(...)

→

4. Launch

agentcore_runtime.launch()

→

5. Invoke

boto3 bedrock-agentcore.invoke_agent_runtime

Five steps, one CLI tool, one boto3 call. The container is built locally (or in CodeBuild), and AgentCore handles the runtime.

Step 1 — wrap the graph with the AgentCore entrypoint

# app.py
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from my_agent import agent          # the compiled LangGraph from earlier

app = BedrockAgentCoreApp()

@app.entrypoint
async def invoke(payload: dict, context):
    """
    payload examples:
      {"action": "start",  "thread_id": "ticket-A-9133", "user_message": "..."}
      {"action": "approve","thread_id": "ticket-A-9133"}
      {"action": "edit",   "thread_id": "ticket-A-9133", "patch": {"draft": {"amount": 99}}}
      {"action": "reject", "thread_id": "ticket-A-9133"}
    """
    cfg = {"configurable": {"thread_id": payload["thread_id"]}}

    if payload["action"] == "start":
        return await agent.ainvoke({"user_message": payload["user_message"]}, cfg)

    if payload["action"] == "approve":
        return await agent.ainvoke(None, cfg)

    if payload["action"] == "edit":
        await agent.aupdate_state(cfg, payload["patch"])
        return await agent.ainvoke(None, cfg)

    if payload["action"] == "reject":
        from langgraph.types import Command
        return await agent.ainvoke(
            Command(goto="execute", update={"verdict": "rejected"}), cfg,
        )

    raise ValueError(f"unknown action {payload['action']}")

if __name__ == "__main__":
    app.run()

Step 2 — configure + launch

# deploy.py — runs locally or in CI
from bedrock_agentcore_starter_toolkit import Runtime

runtime = Runtime()

# (a) Generate the Dockerfile (linux/arm64), .dockerignore, and a default IAM role
runtime.configure(
    entrypoint="app.py",
    auto_create_execution_role=True,
    auto_create_ecr=True,
    requirements_file="requirements.txt",
    region="us-west-2",
    agent_name="refund_hitl_agent",
)

# (b) Build the image with CodeBuild, push to ECR, register the runtime,
#     wait for ENDPOINT_STATUS=READY. Returns the agent ARN.
result = runtime.launch()
print("Agent ARN:", result.agent_arn)
print("Endpoint :", result.agent_endpoint_arn)

Pin the platform

AgentCore Runtime requires linux/arm64 images. The toolkit handles this for you, but if you write your own Dockerfile, set FROM --platform=linux/arm64 public.ecr.aws/docker/library/python:3.12-slim or your local x86 image will fail to start with a useless error.

Step 3 — invoke from anywhere

import boto3, json

client = boto3.client("bedrock-agentcore", region_name="us-west-2")

# 1. Start — graph runs until interrupt_before(execute)
resp = client.invoke_agent_runtime(
    agentRuntimeArn=AGENT_ARN,
    runtimeSessionId="ticket-A-9133",          # MUST match thread_id
    payload=json.dumps({
        "action": "start",
        "thread_id": "ticket-A-9133",
        "user_message": "Please refund order A-9133, the item arrived broken.",
    }).encode(),
)
body = b"".join(resp["response"].iter_chunks())
print(json.loads(body))   # contains the suspended draft for operator review

# 2. Operator approves later (could be 30 minutes later, from a Slack action)
client.invoke_agent_runtime(
    agentRuntimeArn=AGENT_ARN,
    runtimeSessionId="ticket-A-9133",
    payload=json.dumps({"action": "approve", "thread_id": "ticket-A-9133"}).encode(),
)

Notice that runtimeSessionId and our LangGraph thread_id are the same string. That alignment is the trick: AgentCore routes both invocations to the same isolated session, which makes the local in-memory caches warm; the durable state in DynamoDB makes the cold case (a different container picks up the resume call) work identically.

Wiring the operator UI: Slack approvals in one Lambda

The operator interface deserves the same care as the agent. Most teams over-build a custom approval console; what you actually want, in v1, is a Slack message with three buttons (Approve / Edit / Reject) backed by a tiny callback Lambda. The Lambda's only job is: verify the signature, translate the button into an invoke_agent_runtime payload, post the result back to the thread.

# slack_callback.py — fronted by API Gateway, ~50 lines
import json, os, boto3
from urllib.parse import parse_qs

agentcore = boto3.client("bedrock-agentcore")
AGENT_ARN = os.environ["AGENT_ARN"]

def lambda_handler(event, _ctx):
    payload = parse_qs(event["body"])
    action_payload = json.loads(payload["payload"][0])  # Slack interactivity envelope
    action = action_payload["actions"][0]               # approve | reject | edit
    thread_id = action["value"]                         # we encoded the thread_id in value

    body = {"thread_id": thread_id}
    if action["action_id"] == "approve":
        body["action"] = "approve"
    elif action["action_id"] == "reject":
        body["action"] = "reject"
    else:  # "edit" opens a Slack modal; the modal-submit handler resubmits with patch
        return {"statusCode": 200, "body": ""}

    agentcore.invoke_agent_runtime(
        agentRuntimeArn=AGENT_ARN,
        runtimeSessionId=thread_id,
        payload=json.dumps(body).encode(),
    )
    return {"statusCode": 200, "body": json.dumps({"text": f"Action {body['action']} sent."})}

Verify Slack signatures

The 50-line version above skips signature verification for brevity. Don't skip it in production — every Slack request must be authenticated against your signing secret, or anyone on the internet can approve refunds for you. Use slack_sdk.signature.SignatureVerifier.

Security: the boring layers that make HITL actually safe

IAM execution role — the AgentCore runtime role should have only bedrock:InvokeModel, the DynamoDB rights for the checkpoints table, and any tool-specific actions the agent legitimately needs. No * policies. Use aws_iam_policy_simulator in CI to assert no extra grants leak in.
Bedrock Guardrails — attach a guardrail to the model that filters denied topics, redacts PII before logging, and blocks prompt-injection patterns on the input side. Guardrails run inside Bedrock — the model never sees the unsafe content.
Operator authn/authz — use Cognito or your corporate OIDC to gate the Slack-callback Lambda. Pin which Slack workspace + which channel; reject everything else.
State encryption — DynamoDB tables encrypted with a customer-managed KMS key. The checkpointer payload contains full user messages and tool args, which is your most sensitive blob in the system.
Audit trail — every invoke_agent_runtime call writes to CloudTrail; every LangGraph node transition writes to X-Ray via OpenTelemetry. You should be able to reconstruct, for any past action, both what the agent proposed and which human approved it.

The lethal trifecta — recap

Untrusted input + private data + ability to act. HITL is the strongest mitigation for the third edge: the agent doesn't act until a human signs the action. That makes the other two edges (input filtering, data scoping) much more forgiving in practice.

Observability: what to log and what to ignore

Once HITL is live, the metrics that matter are not token counts. The four numbers you want on a single dashboard are: approval rate (how often operators say yes), edit rate (how often they patch the draft before approving), time-to-decision (median + p95 latency from interrupt to verdict), and rejection root-cause distribution (why operators say no — bad plan, wrong tool, policy violation, hallucinated entity). Drift in any of these is your earliest warning that the agent is decaying.

# Instrument LangGraph nodes with OpenTelemetry → X-Ray
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))  # AgentCore exports to X-Ray
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("refund-hitl")

def draft_node(state):
    with tracer.start_as_current_span("draft_node") as span:
        span.set_attribute("user.thread_id", state.get("thread_id", ""))
        draft = llm.with_structured_output(RefundDraft).invoke(state["plan"])
        span.set_attribute("draft.amount", draft.amount)
        span.set_attribute("draft.order_id", draft.order_id)
        return {"draft": draft}

Best practices, distilled

Make the human a node, not a popup. Model it in the graph. It composes with retries, branches, and compensations; popups don't.
Always use a durable checkpointer in production. MemorySaver is a dev-only convenience. Use DynamoDB or Postgres; key by business ID.
Encode invariants in Pydantic, not prompts. amount: float = Field(le=10_000) is worth more than any safety paragraph in a system prompt.
Match `runtimeSessionId` to `thread_id`. AgentCore session affinity + LangGraph durable state = warm path and cold path both work.
Interrupt at action boundaries, not at thinking boundaries. Suspend before tool calls that mutate the world; don't suspend mid-reasoning.
Render the exact tool args to the operator.* Get them from agent.get_state(config), not a re-summarisation. Operators are approving the call that will actually fire.
Make edits cheap. A 3-second Slack modal beats a 30-minute back-and-forth. Edit > Reject + Restart, always.
Pin guardrails on both sides. Bedrock Guardrails on the model; structured outputs on the agent → tool boundary; policy tools that can vote no.
Watch approval/edit/reject ratios, not token counts. Ratios drift first; tokens drift later.
Test the resume path in CI. A test that kills the process between interrupt and resume, then re-launches and resumes from DynamoDB, catches 90% of HITL regressions.

Cost shape: HITL is almost free

A HITL agent on AgentCore costs the operator's time plus pennies. AgentCore Runtime bills per CPU-second and memory-second of active execution — while the graph is suspended waiting for a human, the container's idle and you're paying close to nothing. The model calls (Claude / Nova) are your largest line item, just like in any agent, and they're not affected by the interrupt. DynamoDB checkpoints are kilobyte-sized and cost fractions of a cent per write.

Where AgentSwarms fits

The Notebooks lab has a runnable LangGraph + checkpointer build-along you can clone, plus a HITL failure-mode lab that lets you watch what happens when you skip the interrupt or the durable state. When you're ready to ship, the swarm exports cleanly to LangGraph (and the deploy story in this post applies as-is to Bedrock AgentCore). Everything's free during beta.

Putting it together

Done right, HITL is invisible most of the time. The agent runs. The graph drafts. The high-risk action pauses cleanly, a human glances at it for three seconds, presses approve, and the graph resumes — from a different container, hours later, on a continent across the ocean if it needs to. The pieces aren't exotic: a typed graph in LangGraph, a durable checkpointer in DynamoDB, an interrupt_before on the action node, a Slack callback for the verdict, and Bedrock AgentCore holding the session. Every one of those pieces is replaceable. Put together, they're the difference between an agent you can demo and an agent you can ship.

Comments

Loading comments…