ProductionObservability

When Agents Burn Money: Cost Control in Multi-Agent Systems

A single agent stuck between tool calls can quietly spend more in a weekend than your whole team does in a month. Here's why multi-agent costs spiral, what a runaway actually looks like, and the guardrails — and tools — that stop the bleeding.

AgentSwarms Authors

May 31, 2026· 21 min read·—

ProductionObservability

Someone on a team we know shipped a tidy little multi-agent workflow on a Friday. Four agents, a couple of tools, a plan that looked airtight in the demo. They went home. The agents did not. One of them hit a verification step that never quite passed, asked a teammate agent to help, got handed the task back, and the two of them settled into a polite, infinite conversation. Eleven days later someone opened the billing console. The number was $47,000.

That story has been told enough times now that it's practically folklore, but the details barely matter because the shape is always the same. Nothing crashed. No alert that mattered fired in time. The system did exactly what it was told to do — keep working until the task is done — and the task was never done. The only thing that grew was the bill.

We've spent a lot of words on this site teaching people how to make agents work. This post is about the other half of the job, the half nobody demos: making them work affordably, and making sure that when they go wrong — and they will — they fail cheap instead of failing expensive. If you take one idea away, make it this: in an agentic system, cost is not a billing concern you reconcile at month-end. It's a reliability property you design in from the first line.

Agent costs don't add up. They multiply.

Here's the mental model that gets everyone in trouble. We spent two years with chatbots, and chatbots taught us that cost scales linearly: one user message, one model response, a predictable handful of tokens. Double the users, double the bill. Easy to reason about, easy to budget.

Agents broke that intuition. An agent doesn't answer once — it reasons, calls a tool, reads the result, reasons again, maybe spawns a helper, maybe retries. Every one of those decision points can branch, and every branch can carry the entire conversation so far with it. So spend doesn't accumulate one message at a time. It compounds at every step.

Decision points / depth: 6

Chatbot — linear Agentic — geometric

Each branch can spawn sub-agents and recursive calls, so spend compounds at every decision instead of adding up one message at a time.

Drag the slider. A chatbot's cost grows linearly with interactions (blue). An agentic workflow grows geometrically (red), because each decision can spawn sub-agents, recursive calls, and branching logic that compound at every step.

The numbers people report bear this out. A multi-step agent routinely consumes 5× to 50× the tokens of a single chatbot turn for the same nominal task, and a single careless top-level request can trigger a workflow that burns 100× what you'd expect. The reason is structural, not a bug you can patch: autonomy means the system decides how much work to do, and a confused system decides to do a lot.

Why this is suddenly everyone's problem

The State of FinOps 2026 found that 98% of FinOps practices now manage some form of AI spend — up from 31% just two years earlier. Inference, not training, is now the dominant line item. Agentic workloads are the reason the curve bent.

The anatomy of a spiral: stuck between tool calls

The classic blow-up isn't a sudden explosion — it's a slow, steady drip that never stops. An agent operates in a loop: think, act, observe, repeat, until some condition says done. The danger lives in that last clause. If the stop condition is never satisfied — a tool keeps returning something the model doesn't accept, a verification step keeps failing, a goal is subtly impossible — there's nothing else in the loop that says give up. The agent just keeps calling.

Agent loop · iteration

Spend so far

$0.00

Press play and watch a loop with no guardrail burn money.

Press play. This is a loop with no guardrail: each iteration re-reads the history, calls another tool, and adds to the bill. Notice there's no point where it stops on its own — in production, it doesn't.

What makes it expensive rather than merely annoying is the thing that makes agents work at all: they send the whole conversation back to the model on every step. The transcript of past reasoning and tool results is the agent's working memory, so it travels with each call. That means iteration 30 isn't paying for one step's worth of tokens — it's paying to re-read everything that happened in steps 1 through 29, again.

You don't pay for 3K once — you pay for the whole growing transcript on every call. After 8 tool calls you've been billed for ~108K input tokens, not 24K.

The context tax. The agent doesn't pay for each step in isolation — it re-sends the growing transcript every single call, so your real input-token bill is the cumulative area under this curve, not the height of the last bar.

The part people miss

A loop that adds 3K tokens of context per step looks cheap per step. But because you re-pay for the whole transcript each time, fifteen quiet iterations can cost more than the entire successful run you were budgeting for. Slow drips drown you.

Four ways a multi-agent system bleeds money

“It got stuck” hides several distinct failure modes, and they don't all have the same fix. It's worth being able to name them, because the guardrail that stops one won't always stop another.

A reasoning step keeps failing verification, so the agent retries forever — no stop condition is ever met.

Tap through the common cost-spiral failure modes. Each one bleeds money differently — a stuck loop needs an iteration cap, a retry storm needs backoff and a budget, fan-out needs depth limits.

The fan-out problem

Multi-agent architectures love delegation: an orchestrator breaks a job into pieces and hands each to a sub-agent. That's powerful, and it's also where geometric cost lives. If the orchestrator spawns three workers, and each of those is itself allowed to spawn three more, you're two levels deep and already running thirteen agents. Let that recursion go one level further unchecked and you're funding an army.

🤖

↓ each spawns 3

🤖

↓ each spawns 3

🤖

Orchestration depth: 2

agents spawned

52K

tokens consumed

An orchestrator that fans out to sub-agents — each of which fans out again — multiplies cost exponentially. One careless top-level request can spawn dozens of workers.

Slide the orchestration depth. Each agent that spawns sub-agents — which spawn their own — multiplies token consumption exponentially. Unbounded delegation depth is one careless request away from dozens of workers.

The retry storm and the ping-pong

Two cheaper-looking failures round out the set. A retry storm happens when a flaky tool or a rate limit triggers naïve retries — and because each retry re-sends the full context, ten retries cost ten full calls, not ten cheap pings. Tool ping-pong is the multi-agent version of a stuck loop: two agents hand the same subtask back and forth, each politely deferring to the other, neither ever converging. Both look like progress in the logs. Neither is.

Why your budget alert won't save you

Most teams' first instinct is to add a billing alert: “email me when we cross $500.” It feels responsible. It is almost useless for this problem. An alert is a detection mechanism, and it fires after the money is spent. By the time a human reads it — overnight, over a weekend, during a long meeting — the runaway has had hours to keep running.

⚠️ Budget alert

Fires *after* the money is already spent.

spend $200 → alert

spend $2,000 → alert

you read it Monday → already $47,000

🛑 Budget enforcement

Checked *before* the next call — and blocks it.

next call would exceed cap?

→ refuse, halt run, escalate

max loss = the cap you set

The distinction that matters most. A budget alert tells you about spend after it happens. Budget enforcement is checked before the next call completes and refuses to make it. One bounds your worst case; the other just narrates it.

A token budget alert is not budget enforcement. The first tells you the house is on fire. The second is the sprinkler.— paraphrasing the now-famous “$47,000 agent loop” write-up

The fix is to move the check before the call, not after the invoice. Before the agent makes its next model request, you ask: would this push us past the limit we set? If yes, the call doesn't happen. The run halts, escalates, or returns its best partial answer. Your maximum loss becomes the ceiling you chose — a number you decided on purpose, instead of one the agent discovered for you.

Building the guardrails

There's no single switch for this. Cost control in agents is defense in depth: a small stack of independent limits, any one of which can stop a runaway, layered so that the failure of one doesn't mean the failure of all. Here's the stack we reach for, roughly in order of how much grief each one saves you.

🚨 Nothing here can halt the loop. Turn on at least one hard limit.

Toggle the guardrails. You don't need all of them, but you need at least one hard limit that can actually halt the loop. Alerts and dashboards are not on this list because they can't stop anything.

1. A hard iteration cap

The simplest and most important guardrail: every agent run gets a maximum number of steps. If it hasn't finished in, say, 15 iterations, it stops — with an error, a partial result, or an escalation to a human, but it stops. This single limit turns “infinite” into “bounded,” which is most of the battle.

MAX_ITERS = 15
MAX_USD = 0.50            # hard ceiling per run
spent = 0.0

for step in range(MAX_ITERS):
    # estimate the cost of the NEXT call before making it
    projected = spent + estimate_cost(messages, model)
    if projected > MAX_USD:
        return halt(reason="cost_ceiling", spent=spent)

    resp = model.call(messages)
    spent += resp.usage.cost_usd        # track real spend per step

    if resp.is_final_answer:
        return resp
    messages = append_tool_result(messages, run_tool(resp))

# loop exhausted without finishing — fail cheap, don't fail expensive
return halt(reason="max_iters", spent=spent)

2. Token and dollar budgets, enforced pre-call

An iteration cap bounds steps, but steps aren't all equal — one call with a 200K-token context costs far more than ten small ones. So pair the step cap with a token budget (prompt + completion summed across the whole run) and, better still, a dollar ceiling checked before each call, exactly as in the snippet above. Gateways make this easy to enforce centrally rather than re-implementing it in every agent.

3. A global timeout

Wall-clock time is a backstop for everything you didn't anticipate. A strict global timer — kill the entire chain after N seconds — catches the slow hang, the tool that never returns, the recursion you didn't bound. It's blunt, and that's the point: it doesn't need to understand why things went wrong to stop them.

4. Repeat-action detection

A loop has a tell: the agent keeps doing the same thing. Before executing an action, compare it against the last few steps. If it's about to call the same tool with the same arguments it used two steps ago, it isn't making progress — it's spinning. Block the duplicate, inject a nudge, or terminate. This catches stuck loops semantically, often well before the iteration cap would.

step 1: search('refund policy')✓ allowed

step 2: search('refund window')✓ allowed

step 3: search('refund policy')🛑 blocked — seen before

step 4: search('refund policy')🛑 blocked — seen before

Before executing an action, compare it to the last few steps. If the agent is about to make the exact same call again, it's looping — break it instead of paying for it.

A simple dedup layer in action: repeated identical tool calls get blocked instead of billed. Comparing each proposed action against a short history is cheap and catches the most common loop signature.

5. Budget pressure: tell the model it's running out

A subtler technique that works surprisingly well: don't just cut the agent off silently — warn it as it approaches the limit. Inject a system message a few iterations before the cap (“you have 3 steps left, wrap up and give your best answer; do not call more tools”). Models respond to this. It turns a hard, wasteful termination into a graceful landing, and often produces a usable answer instead of an error.

remaining = MAX_ITERS - step
if remaining <= 3:
    messages.append({
        "role": "system",
        "content": (
            f"You have {remaining} steps left. Stop calling tools. "
            "Give your best final answer now with what you already know."
        ),
    })

6. Context management

Because the transcript is what makes long runs expensive, managing it is a cost lever, not just a quality one. After a few iterations, summarize or drop stale tool results so the context stops growing without bound. Retrieve only what the next step needs instead of carrying everything forward. And cap delegation depth so the fan-out tree can't recurse forever. None of these are exotic — they're hygiene — but skipping them is what turns a working agent into an expensive one.

Tools that do the heavy lifting

You don't have to build all of this from scratch, and you shouldn't. There's a healthy ecosystem now — much of it open source — split roughly into two camps: observability tools that show you where the money goes (so you can find the spirals), and gateway/proxy tools that sit in front of your model calls and enforce limits centrally. The strongest setups pair one of each.

Tool

Cost-control best practices

Pulling it together, here's the checklist we'd want on the wall before any multi-agent system touches a real budget. Tick them off below — the ones that aren't checked are exactly the ways your next bill surprises you.

The working checklist. Notice that none of these are about making the agent smarter — they're about bounding what it can spend when it isn't.

1Design limits in from day one. Cost controls bolted on after launch always arrive after the first scary invoice. Treat a missing iteration cap like a missing null check.
2Enforce, don't alert. Check budgets before the next call and refuse it. Keep the alerts too — but never rely on them to stop a runaway.
3Cap iterations, tokens, dollars, and time. Four independent limits. Any one can save you; together they're hard to defeat.
4Detect loops semantically. Block repeated identical actions and ping-pong between agents before the iteration cap even bites.
5Bound delegation depth. Limit how many levels of sub-agents can spawn, or fan-out will do your budget's math for you.
6Manage the context. Summarize or trim history between steps; retrieve only what's needed. The transcript is the meter.
7Route by difficulty. Send routine steps to a cheap model; reserve the frontier tier for the reasoning that actually needs it.
8Make every run observable. Log per-run token and dollar cost and trace it. You can't control what you can't see — and the spiral you can see is the spiral you can stop.

The uncomfortable truth under all of this is that an autonomous system will, eventually, do something you didn't plan for. That's not a reason to avoid agents — it's the whole reason guardrails exist. You don't get to guarantee an agent never gets stuck. You do get to guarantee that when it does, it costs you fifty cents and a log line instead of a weekend and $47,000. That choice is yours to make, and the only wrong time to make it is after the fact.

The one-sentence version

You can't stop an agent from ever going wrong — but with a hard limit before every call, you can decide in advance exactly how much it's allowed to cost when it does.

Comments

Loading comments…