DevOps & InfrastructureFrameworks

DevOps for Agentic AI: An Open-Source Playbook

Shipping a prompt change is a deploy. Shipping a model swap is a deploy. Even rebuilding a knowledge base is a deploy. Here's how to do all of that the way you'd ship any production system — with eval gates, canaries, traces, cost caps, and a one-click rollback — using only open-source tools.

AgentSwarms Authors

May 28, 2026· 22 min read·—

DevOps & InfrastructureFrameworks

Most teams shipping agentic AI in 2026 are shipping it the way we shipped websites in 2008: a clever person changes a prompt, eyeballs an output, and pushes to prod. It works until it doesn't — until the day a one-line edit drops accuracy across the long tail, or a stuck reflection loop quintuples your invoice, or an unannounced model deprecation breaks the whole product over a weekend. This post is the playbook we wish we'd had earlier: how to give your agentic system the same discipline we give any other piece of production software, using only open-source pieces, and how to do it without the ceremony killing your velocity.

If you want the verdict in one breath: a prompt change is a deploy, a model swap is a deploy, even rebuilding a knowledge base is a deploy — so treat every one of them with versioning, an eval gate, a canary, traces, a cost budget, and a one-click rollback. Skip any of those and you don't have DevOps for agents; you have a wish.

Why agentic systems need their own brand of DevOps

Classic CI/CD assumes deterministic software. You change a function, the tests either pass or they don't, and the same input gives the same output forever. Agentic systems break every one of those assumptions at once — and the standard pipeline silently lets every problem through because none of them throw a 500.

Non-determinism is the default. The same input takes a different path through the swarm, costs a different number of tokens, and can produce a different answer.
Failures are silent. A confident, wrong answer doesn't trip a health check — your logs stay green while quality erodes.
Behaviour is defined in places git has never seen. A prompt edited live in a UI, a model swap in a config file, a re-chunked KB — all change behaviour without a code commit.
Cost is unbounded by default. A loop you forgot to cap, multiplied by a viral post, is a money fire that nothing in your normal stack alerts on.
The inputs aren't all yours. Retrieved documents, tool outputs, and user messages can carry injections that change what your agent does.

The bottom line

Classic DevOps measures whether your code is correct. Agentic DevOps has to measure whether your system is useful — and 'useful' is fuzzy, drifting, and only visible if you instrument for it. The whole pipeline is built around closing that gap.

The loop, end to end

→

📐 Plan

Define success, write a golden eval dataset, agree on budgets and risk gates before any code.

The six-stage loop that turns agentic prototypes into production systems. Click each — every stage has a uniquely agentic twist (a golden eval dataset isn't optional, shadow + canary isn't optional, a 100% deploy isn't an option).

We'll walk each stage in turn, but the order matters: plan dictates what you build, build dictates what you evaluate, evals decide what you deploy, observation decides what you improve, and improvement reshapes the plan. Skip the plan stage and you'll spend the next year retrofitting the others.

Stage 1 · plan, before you write a line of code

The single most common reason agentic projects stall isn't model quality — it's that nobody decided what 'good' meant before they started building. Plan stage is unglamorous, takes a week, and saves you six months of going-in-circles. Do these four things on day one:

1Write the success rubric. What does a great answer look like? What's a non-negotiable failure? Faithfulness, latency, cost-per-resolved-task, refusal correctness — pick the small set that matters.
2Build the golden dataset. 40–200 representative inputs with known-good answers (or known-good behaviour). This becomes the bar every deploy clears. It costs you a person-week and it's worth ten of them.
3Set the budgets. Per-agent cost per request, per-tenant rate limits, max iterations on every loop, model fallback tiers. Budgets are how you survive a viral day.
4Decide who owns what. Every agent, prompt, tool, and knowledge base needs a name on it. Ownership ambiguity is the failure mode behind every untraceable regression.

Borrow from MLOps, not DevOps

Classic DevOps treats code as the asset. Agentic DevOps has to treat the agent — prompt + model + tools + guardrails + KB — as the asset. Mentally cross out 'application' and write 'agent' on every pipeline diagram, and a lot of the planning falls into place.

Stage 2 · what to actually version

If a piece of your system can change the agent's behaviour and it isn't in git, you don't have DevOps — you have folklore. Every bug report you can't reproduce is a thing that wasn't versioned. The goal is one SHA you can roll back to that fully defines how the agent behaved.

Reproducibility0%

Anything you don't version is something a bug report can't reproduce. The goal is a single SHA that maps to every behaviour-defining piece of the agent.

Tick each artifact you put under version control. Reproducibility climbs as you lock down more of the behaviour-defining surface. Anything left unlocked is a back-channel that can change behaviour without a commit.

The biggest single gap we see

Most teams version prompts and forget the agent definition — which model, which tools, which temperatures, which guardrail settings. A prompt without its agent definition is half the picture. Make the agent a single declarative artifact (a YAML or JSON file) checked into the same repo as the rest of the code.

# agents/refund-triage.v3.yaml — one file, one agent, in git.
agent:
  name: refund-triage
  version: 3
  model:
    provider: ollama          # or openai, anthropic, vertex, …
    name: llama3.1:70b
    temperature: 0.1
    max_tokens: 512
  system_prompt: prompts/refund-triage.v3.md
  tools:
    - read_orders             # scoped: read-only on orders table
    - send_email              # behind a human-approval gate
  guardrails:
    pii_redaction: true
    output_schema: schemas/triage_output.json
  retrieval:
    knowledge_base: kb/support-policies@2026-05-21
  budgets:
    max_iterations: 3
    max_tokens_per_run: 4000
    daily_usd: 50

Stage 3 · build the pipeline (the gamified version)

Here's the question every team should be able to answer in one breath: what stages does a change pass through before it touches a user? If the answer is 'a prompt swap and a deploy', you've already lost. Play with the toggles below — the score is hand-tuned, but the direction is honest: an eval gate plus a canary plus a rollback dwarfs any single stage alone.

Pipeline safety score0 / 100 · Cowboy mode 🤠

No single stage saves you — the gain is in stacking them. The score is hand-tuned but the direction is honest: an eval gate + a canary + a rollback dwarfs any one of them alone.

Toggle the stages you actually have in your pipeline today and watch the safety score climb. Stop adding stages when it stops adding score per week of effort — the goal is enough discipline to ship safely, not the maximum possible discipline.

A reasonable open-source pipeline for a smallish team looks like this: a GitHub Actions workflow runs unit tests on tool adapters, then runs the agent against the golden set, then opens a PR comment with the eval scorecard. On merge to main, a shadow deploy runs the new agent on real traffic without serving its output. A few hours later, a canary deploy promotes it to 5–10%. KPIs stay healthy for a day, it promotes to 100%. KPIs slip, it rolls back automatically. None of this requires a vendor — every piece below is open source.

# .github/workflows/agent-ci.yml
name: agent-ci
on: [pull_request]

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install -r requirements.txt
      - name: Unit tests (tools + adapters)
        run: pytest tests/ -q
      - name: Eval gate (golden set, RAGAS + custom rubric)
        run: |
          python evals/run.py \
            --agent agents/refund-triage.v3.yaml \
            --dataset evals/golden/refund-triage.jsonl \
            --judge ollama:llama3.1:70b \
            --threshold faithfulness=0.85,answer_relevance=0.8
      - name: Cost gate
        run: python evals/cost_gate.py --max-per-run 0.05
      - name: Post scorecard as PR comment
        if: always()
        run: python evals/post_pr_comment.py

Stage 4 · the eval gate (and why it has to be more than vibes)

The eval gate is the single highest-leverage thing in this whole pipeline. Without it, a prompt change is just a deploy with your fingers crossed. With it, you catch the regressions that look fine in spot-checks but tank on the long tail.

✓Eval gate (golden set)

✓Shadow / canary 5%

✓Live KPIs healthy

Promoted to 100% ✓

Three classes of change, one pipeline. A healthy change sails through. A quality regression dies at the eval gate (before any user sees it). A subtle cost regression sneaks past the gate but is caught by the canary and auto-rolled back. The trick is having both layers.

Reference-free metrics (faithfulness, answer relevance) catch grounding regressions even when you don't have a single right answer.
Reference-based metrics (exact-match, similarity vs. a known-good answer) catch the cases where you do.
LLM-as-judge is great for nuance but needs to be calibrated against human labels first — otherwise you're trusting one black box to grade another.
Live evals sample a percentage of production traffic and re-grade it, catching drift the offline set will never see.
Don't ship a single number. Score per dimension, with a threshold each. 'Overall 4.2/5' hides the fact that faithfulness dropped from 4.8 to 3.6 while tone improved.

Open-source evals worth knowing

RAGAS for RAG metrics (faithfulness, context recall/precision, answer relevance). DeepEval for unit-test-style assertions. promptfoo for prompt regression tests and side-by-side comparisons. Pair them with Langfuse (self-hosted, OSS) or LangSmith (hosted) for traces and online evals. You can stand a full eval pipeline up in a weekend.

# evals/run.py — a tiny but real eval gate.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
import json, sys

dataset = [json.loads(l) for l in open("evals/golden/refund-triage.jsonl")]

# Run the agent on every golden input, collect (q, contexts, answer).
runs = [run_agent(d["question"]) for d in dataset]

scores = evaluate(
    dataset=[{**d, **r} for d, r in zip(dataset, runs)],
    metrics=[faithfulness, answer_relevancy, context_precision],
)

# Hard fail if any metric drops below the threshold for THIS change.
THRESHOLDS = {"faithfulness": 0.85, "answer_relevancy": 0.80, "context_precision": 0.75}
failed = {k: scores[k] for k, t in THRESHOLDS.items() if scores[k] < t}
if failed:
    print("EVAL GATE FAILED:", failed); sys.exit(1)
print("EVAL GATE PASSED:", scores)

Stage 5 · deploy — shadow, then canary, then promote

The 100% flip is the great original sin of agentic deploys. The model you tested in CI isn't the model you're shipping — it's the same weights interacting with real distribution, real load, real noise. Stage the rollout and you get to find that out at low blast radius.

1Shadow (0% serving). The new agent runs on real traffic, but its output is logged and graded, never returned to the user. Catches the issues the golden set missed.
2Canary (5–10%). A slice of real users gets the new agent. KPIs are watched in real time. A regression triggers an automatic rollback.
3Promote (100%). Only after both phases pass. The previous version stays warm for an instant rollback for the next 24 hours.

The OSS rollout stack

Argo Rollouts and Flagger both do canary / blue-green for any service, including agent runtimes. Pair with OpenFeature for per-user / per-tenant flag-gated rollouts when you need finer control than a percentage.

Stage 6 · observe (the four signals)

Once an agent is live, observability stops being optional and starts being the only way you'll ever debug it. You need four kinds of signal, and the agentic ones aren't optional:

Logs — the raw event records you grep when something's weird.
Metrics — the aggregated numbers you alert on (latency, cost, error rate, refusal rate).
Traces — every span of every run: which agent ran, which tool was called, with what args, what came back, how many tokens. Non-negotiable for agents.
Live evals — a sample of production runs re-graded automatically against your rubric. The only way to catch a quality drift before users do.

The open-source observability stack

Langfuse (self-hostable, AGPL) and Phoenix from Arize (OSS) cover traces + online evals. OpenLLMetry / OpenTelemetry-LLM exports spans to whatever backend you already run — Grafana, Tempo, Jaeger, anything. Helicone proxies LLM calls and gives you metrics + cost out of the box. Pick one trace tool and one cost/metric tool; you don't need all of them.

Stage 7 · improve (the loop closes)

The single highest-leverage improvement loop is this: every real failure in production should automatically become a candidate test case in your golden set. A user complaint, a low live-eval score, a manual flag — all funnel into a triage queue, get reviewed once a week, and the chosen ones get a known-good answer and join the gate. Within months, your eval set is no longer a hand-picked sample; it's a living record of what your system has been wrong about. That's when the pipeline starts learning.

Best practices, in plain English

One agent = one declarative artifact. Model, prompt path, tools, guardrails, budgets — all in one file under git. The file's SHA is the version.
Three environments minimum. Dev (fast iteration, no real users), staging (golden-set evals + shadow on real traffic), prod. Same agent definition flows through all three.
Secrets live in a vault. Never in the prompt, never on the client, never in the trace. Vault, AWS/GCP Secret Manager, Doppler, Infisical — pick one and use it.
Trace IDs everywhere. A request ID that propagates from the user's click through the gateway, orchestrator, every tool call, and every LLM call — so 'what did it do?' takes seconds, not hours.
Cost is a metric. Track $/resolved-task, not $/request. A swarm that takes 6 calls to get one resolution is cheaper than one that takes 2 and is wrong.
Pin model versions. claude-3.5-sonnet-20240620, not claude-3.5-sonnet. The implicit upgrade is the bug you'll spend the longest debugging.
Multi-provider fallback in CI. Run a small slice of evals against your backup model once a week. The day you actually need it is the worst day to discover it doesn't work.
Treat retrieval as code. The chunker, the embedder, the index — all versioned. KB rebuilds go through the same eval gate as code changes.
Human-in-the-loop on risky actions. Refunds, deletes, sends — an Approval node, not a hope.
Document the model of failure. When you fix something, write down what broke, how you caught it, and what test would have caught it earlier. That document is your real on-call runbook.

Failure modes & gotchas

These are the ones that have actually taken teams down. Click through each — every one has a fix that, in hindsight, was a one-week project nobody had time for.

⚠ Silent regression

Symptom: A prompt or model change passes manual smoke tests but quietly tanks quality across long-tail inputs.

Fix: An eval gate on a broad golden set — a few hand-picked tests are not enough.

Eight DevOps gotchas specific to agentic systems. Click each card to see the symptom and the fix. Almost every one is invisible in a normal CI run — they only show up if you've added the agentic-specific guard for them.

The gotcha behind half of them

Most of these failures share a root cause: the agent's behaviour depends on something that wasn't under version control. A prompt edited in a UI, a model that auto-upgraded, a KB re-ingested without an SHA. Make every behaviour-defining piece an artifact in git and you've already prevented most of the list.

Cost & sustainability

Cost isn't a finance problem — it's a reliability problem in disguise. An agent that costs $0.02 per request at $1k/day MRR will quietly cost $400/day when traffic 20×s. Without a cap, that day is also the day the model gets rate-limited and your latency triples. Build the cost gate before you ship, not after the bill arrives.

requests / day5,000avg loop iterations3

$60

today's bill

$200

daily budget

denied by cap

Cost gates aren't just a budget tool — they're how you survive a Reddit hug of death without paging an engineer.

Drag traffic and average loop iterations to see today's bill. Toggle the cost cap and watch how much the cap denies on a bad day. Cost caps aren't just a budget tool — they're how you survive a Reddit hug of death without paging engineering.

A daily $ budget per agent, with an alert at 70% and a hard stop at 100%.
Per-tenant rate limits so one customer can't burn the budget for everyone.
Cheaper models for the easy cases. A small model triages; the strong model only handles what the small one flags.
Cache aggressively. Embeddings, identical prompts, prefix caches — anything stable.
Bound every loop. Three iterations is usually enough; ten is usually a bug.

The open-source stack we'd reach for

You can do all of this with open-source primitives. Here's the minimal stack that earns its keep:

Source of truth — Git (your repo) + GitHub Actions / GitLab CI / Argo Workflows for the pipeline runner.
Agent framework — LangGraph (state machines), CrewAI (role-based), or AutoGen/AG2 (conversational). Each is OSS; pick by control vs. speed.
Evals in CI — RAGAS for RAG metrics, DeepEval for assertions, promptfoo for prompt regression. Add Inspect (UK AI Safety Institute) for safety evals.
Tracing & online evals — Langfuse (self-hostable) or Arize Phoenix. OpenLLMetry/OpenTelemetry to ship spans to your existing backend.
Cost & gateway — LiteLLM as a model router with budgets; Helicone as a logging proxy; or Portkey for both.
Rollouts — Argo Rollouts or Flagger for canary/blue-green, OpenFeature for per-tenant flags.
Secrets & policy — Vault / Infisical for secrets; OPA / Cedar for authorization policies on tool calls.
Vector store & KB — Qdrant, Weaviate, or pgvector. Version the ingest pipeline (chunker + embedder + index name); rebuilds get an SHA.
Local serving — vLLM, SGLang, or Ollama for self-hosted inference; pair with a fallback to a hosted API for headroom.

Pick one of each, not all of them

The temptation is to stack every tool. Don't. One framework, one trace tool, one evals tool, one router. The maturity gain comes from running the loop, not from running it on more vendors.

The maturity ladder (an honest self-assessment)

Most teams don't sit at one tier — they're advanced on tracing and primitive on rollouts, or vice versa. This is the self-check we use. Tick what you actually do today, not what you mean to do.

Check each practice your team actually has in place — get an honest tier, not a score you want to see.

0 / 10 practices🧎 Crawl · 0%

An honest tier from Crawl → Walk → Run → Fly. The next-tier move usually isn't a tool — it's a practice you keep skipping. Pick one unchecked box, fix it this sprint, re-score next month.

A reasonable 30 / 60 / 90-day plan

1Days 1–30 — get to honest. Write the success rubric, build a 40-question golden set, put every prompt and agent definition under git. Add a basic trace for every run. You're not deploying anything new — you're making the current system visible.
2Days 31–60 — build the gate. A CI job that runs the agent against the golden set on every PR and blocks regressions. A cost cap. A shadow deploy that runs the new agent on real traffic without serving it. You can now ship changes safely, even if slowly.
3Days 61–90 — automate the loop. Canary + auto-rollback. Live evals on a sample of production traffic. A weekly triage of low-scoring runs that promotes the best ones into the golden set. The pipeline now improves on its own; your job becomes watching the trend lines.

What Microsoft got right (and where we go open-source)

Microsoft's recent piece on CI/CD for AI agents on Foundry frames the problem the same way we do — agents are deployable artifacts that need versioning, evals, and staged rollouts. The mechanics they describe (an agent definition, an eval gate before promotion, an environment progression from dev to prod) are exactly the right primitives. Where their playbook leans on Foundry, Azure DevOps, and Bicep, this one leans on git, GitHub Actions, Langfuse, RAGAS, Argo Rollouts, vLLM, and friends — same loop, different substrate. Read theirs alongside this one; the overlap is the part that's actually load-bearing.

The shortest possible summary

Version everything that defines behaviour. Gate every change on an eval. Roll out in stages. Trace every run. Cap every cost. Feed real failures back into the gate. That loop, run boringly for six months, is the whole game.

DevOps for agents isn't a different discipline — it's the same discipline that turned web apps from artisanal to reliable, applied to a kind of software that's harder to test and easier to break. The teams shipping reliable agentic systems in 2026 aren't smarter; they're just running this loop. Pick the one unchecked box that scared you most on the maturity widget above, and fix it this week. Then the next one. That's the entire path from cowboy mode to production-grade.

Comments

Loading comments…