When Agents Burn Money: Cost Control in Multi-Agent Systems
A single agent stuck between tool calls can quietly spend more in a weekend than your whole team does in a month. Here's why multi-agent costs spiral, what a runaway actually looks like, and the guardrails — and tools — that stop the bleeding.
Someone on a team we know shipped a tidy little multi-agent workflow on a Friday. Four agents, a couple of tools, a plan that looked airtight in the demo. They went home. The agents did not. One of them hit a verification step that never quite passed, asked a teammate agent to help, got handed the task back, and the two of them settled into a polite, infinite conversation. Eleven days later someone opened the billing console. The number was $47,000.
That story has been told enough times now that it's practically folklore, but the details barely matter because the shape is always the same. Nothing crashed. No alert that mattered fired in time. The system did exactly what it was told to do — keep working until the task is done — and the task was never done. The only thing that grew was the bill.
We've spent a lot of words on this site teaching people how to make agents work. This post is about the other half of the job, the half nobody demos: making them work affordably, and making sure that when they go wrong — and they will — they fail cheap instead of failing expensive. If you take one idea away, make it this: in an agentic system, cost is not a billing concern you reconcile at month-end. It's a reliability property you design in from the first line.
Agent costs don't add up. They multiply.
Here's the mental model that gets everyone in trouble. We spent two years with chatbots, and chatbots taught us that cost scales linearly: one user message, one model response, a predictable handful of tokens. Double the users, double the bill. Easy to reason about, easy to budget.
Agents broke that intuition. An agent doesn't answer once — it reasons, calls a tool, reads the result, reasons again, maybe spawns a helper, maybe retries. Every one of those decision points can branch, and every branch can carry the entire conversation so far with it. So spend doesn't accumulate one message at a time. It compounds at every step.
Each branch can spawn sub-agents and recursive calls, so spend compounds at every decision instead of adding up one message at a time.
The numbers people report bear this out. A multi-step agent routinely consumes 5× to 50× the tokens of a single chatbot turn for the same nominal task, and a single careless top-level request can trigger a workflow that burns 100× what you'd expect. The reason is structural, not a bug you can patch: autonomy means the system decides how much work to do, and a confused system decides to do a lot.
The State of FinOps 2026 found that 98% of FinOps practices now manage some form of AI spend — up from 31% just two years earlier. Inference, not training, is now the dominant line item. Agentic workloads are the reason the curve bent.
The anatomy of a spiral: stuck between tool calls
The classic blow-up isn't a sudden explosion — it's a slow, steady drip that never stops. An agent operates in a loop: think, act, observe, repeat, until some condition says done. The danger lives in that last clause. If the stop condition is never satisfied — a tool keeps returning something the model doesn't accept, a verification step keeps failing, a goal is subtly impossible — there's nothing else in the loop that says give up. The agent just keeps calling.
What makes it expensive rather than merely annoying is the thing that makes agents work at all: they send the whole conversation back to the model on every step. The transcript of past reasoning and tool results is the agent's working memory, so it travels with each call. That means iteration 30 isn't paying for one step's worth of tokens — it's paying to re-read everything that happened in steps 1 through 29, again.
You don't pay for 3K once — you pay for the whole growing transcript on every call. After 8 tool calls you've been billed for ~108K input tokens, not 24K.
A loop that adds 3K tokens of context per step looks cheap per step. But because you re-pay for the whole transcript each time, fifteen quiet iterations can cost more than the entire successful run you were budgeting for. Slow drips drown you.
Four ways a multi-agent system bleeds money
“It got stuck” hides several distinct failure modes, and they don't all have the same fix. It's worth being able to name them, because the guardrail that stops one won't always stop another.
The fan-out problem
Multi-agent architectures love delegation: an orchestrator breaks a job into pieces and hands each to a sub-agent. That's powerful, and it's also where geometric cost lives. If the orchestrator spawns three workers, and each of those is itself allowed to spawn three more, you're two levels deep and already running thirteen agents. Let that recursion go one level further unchecked and you're funding an army.
An orchestrator that fans out to sub-agents — each of which fans out again — multiplies cost exponentially. One careless top-level request can spawn dozens of workers.
The retry storm and the ping-pong
Two cheaper-looking failures round out the set. A retry storm happens when a flaky tool or a rate limit triggers naïve retries — and because each retry re-sends the full context, ten retries cost ten full calls, not ten cheap pings. Tool ping-pong is the multi-agent version of a stuck loop: two agents hand the same subtask back and forth, each politely deferring to the other, neither ever converging. Both look like progress in the logs. Neither is.
Why your budget alert won't save you
Most teams' first instinct is to add a billing alert: “email me when we cross $500.” It feels responsible. It is almost useless for this problem. An alert is a detection mechanism, and it fires after the money is spent. By the time a human reads it — overnight, over a weekend, during a long meeting — the runaway has had hours to keep running.
Fires *after* the money is already spent.
Checked *before* the next call — and blocks it.
A token budget alert is not budget enforcement. The first tells you the house is on fire. The second is the sprinkler.— paraphrasing the now-famous “$47,000 agent loop” write-up
The fix is to move the check before the call, not after the invoice. Before the agent makes its next model request, you ask: would this push us past the limit we set? If yes, the call doesn't happen. The run halts, escalates, or returns its best partial answer. Your maximum loss becomes the ceiling you chose — a number you decided on purpose, instead of one the agent discovered for you.
Building the guardrails
There's no single switch for this. Cost control in agents is defense in depth: a small stack of independent limits, any one of which can stop a runaway, layered so that the failure of one doesn't mean the failure of all. Here's the stack we reach for, roughly in order of how much grief each one saves you.
1. A hard iteration cap
The simplest and most important guardrail: every agent run gets a maximum number of steps. If it hasn't finished in, say, 15 iterations, it stops — with an error, a partial result, or an escalation to a human, but it stops. This single limit turns “infinite” into “bounded,” which is most of the battle.
MAX_ITERS = 15
MAX_USD = 0.50 # hard ceiling per run
spent = 0.0
for step in range(MAX_ITERS):
# estimate the cost of the NEXT call before making it
projected = spent + estimate_cost(messages, model)
if projected > MAX_USD:
return halt(reason="cost_ceiling", spent=spent)
resp = model.call(messages)
spent += resp.usage.cost_usd # track real spend per step
if resp.is_final_answer:
return resp
messages = append_tool_result(messages, run_tool(resp))
# loop exhausted without finishing — fail cheap, don't fail expensive
return halt(reason="max_iters", spent=spent)2. Token and dollar budgets, enforced pre-call
An iteration cap bounds steps, but steps aren't all equal — one call with a 200K-token context costs far more than ten small ones. So pair the step cap with a token budget (prompt + completion summed across the whole run) and, better still, a dollar ceiling checked before each call, exactly as in the snippet above. Gateways make this easy to enforce centrally rather than re-implementing it in every agent.
3. A global timeout
Wall-clock time is a backstop for everything you didn't anticipate. A strict global timer — kill the entire chain after N seconds — catches the slow hang, the tool that never returns, the recursion you didn't bound. It's blunt, and that's the point: it doesn't need to understand why things went wrong to stop them.
4. Repeat-action detection
A loop has a tell: the agent keeps doing the same thing. Before executing an action, compare it against the last few steps. If it's about to call the same tool with the same arguments it used two steps ago, it isn't making progress — it's spinning. Block the duplicate, inject a nudge, or terminate. This catches stuck loops semantically, often well before the iteration cap would.
Before executing an action, compare it to the last few steps. If the agent is about to make the exact same call again, it's looping — break it instead of paying for it.
5. Budget pressure: tell the model it's running out
A subtler technique that works surprisingly well: don't just cut the agent off silently — warn it as it approaches the limit. Inject a system message a few iterations before the cap (“you have 3 steps left, wrap up and give your best answer; do not call more tools”). Models respond to this. It turns a hard, wasteful termination into a graceful landing, and often produces a usable answer instead of an error.
remaining = MAX_ITERS - step
if remaining <= 3:
messages.append({
"role": "system",
"content": (
f"You have {remaining} steps left. Stop calling tools. "
"Give your best final answer now with what you already know."
),
})6. Context management
Because the transcript is what makes long runs expensive, managing it is a cost lever, not just a quality one. After a few iterations, summarize or drop stale tool results so the context stops growing without bound. Retrieve only what the next step needs instead of carrying everything forward. And cap delegation depth so the fan-out tree can't recurse forever. None of these are exotic — they're hygiene — but skipping them is what turns a working agent into an expensive one.
Tools that do the heavy lifting
You don't have to build all of this from scratch, and you shouldn't. There's a healthy ecosystem now — much of it open source — split roughly into two camps: observability tools that show you where the money goes (so you can find the spirals), and gateway/proxy tools that sit in front of your model calls and enforce limits centrally. The strongest setups pair one of each.
- LiteLLM — a unified gateway across providers with per-key and per-team budgets, rate limits, and spend caps built in. The most direct way to enforce a hard dollar ceiling without touching agent code.
- Portkey — gateway focused on routing, fallbacks, load balancing, and budget limits with minimal overhead; good when you want resilience and cost control in one hop.
- Helicone — a proxy that adds caching (don't pay twice for the same call) plus cost tracking; the cache alone can meaningfully cut spend on repetitive workloads.
- Langfuse (MIT) — the most full-featured open-source observability tool; traces every call with token and cost breakdowns, and ingests cost data directly from LiteLLM.
- Opik (Apache-2.0) and Phoenix (Arize) — open-source tracing and evaluation; self-hostable when prompts and data can't leave your infrastructure.
- OpenLLMetry — OpenTelemetry-based instrumentation, so your LLM spans flow into the same observability backend as the rest of your stack.
- Portal26 Agentic Token Controls and similar commercial entrants — purpose-built to cap runaway agent spend specifically, a sign the market now treats this as a first-class problem.
Self-host (Langfuse, Opik, LiteLLM proxy) when data residency matters or per-request pricing hurts at scale. Reach for the managed cloud tiers when time-to-value and zero-maintenance beat everything. Either way: a gateway for enforcement, an observability tool for visibility.
Cost-control best practices
Pulling it together, here's the checklist we'd want on the wall before any multi-agent system touches a real budget. Tick them off below — the ones that aren't checked are exactly the ways your next bill surprises you.
- 1Design limits in from day one. Cost controls bolted on after launch always arrive after the first scary invoice. Treat a missing iteration cap like a missing null check.
- 2Enforce, don't alert. Check budgets before the next call and refuse it. Keep the alerts too — but never rely on them to stop a runaway.
- 3Cap iterations, tokens, dollars, and time. Four independent limits. Any one can save you; together they're hard to defeat.
- 4Detect loops semantically. Block repeated identical actions and ping-pong between agents before the iteration cap even bites.
- 5Bound delegation depth. Limit how many levels of sub-agents can spawn, or fan-out will do your budget's math for you.
- 6Manage the context. Summarize or trim history between steps; retrieve only what's needed. The transcript is the meter.
- 7Route by difficulty. Send routine steps to a cheap model; reserve the frontier tier for the reasoning that actually needs it.
- 8Make every run observable. Log per-run token and dollar cost and trace it. You can't control what you can't see — and the spiral you can see is the spiral you can stop.
The uncomfortable truth under all of this is that an autonomous system will, eventually, do something you didn't plan for. That's not a reason to avoid agents — it's the whole reason guardrails exist. You don't get to guarantee an agent never gets stuck. You do get to guarantee that when it does, it costs you fifty cents and a log line instead of a weekend and $47,000. That choice is yours to make, and the only wrong time to make it is after the fact.
You can't stop an agent from ever going wrong — but with a hard limit before every call, you can decide in advance exactly how much it's allowed to cost when it does.
Further reading & references
- The $47,000 Agent Loop: Why Token Budget Alerts Aren't Budget Enforcement (dev.to)
- AI Agents Burn 50x More Tokens Than Chats (LeanOps)
- Portal26 launches Agentic Token Controls to cap runaway AI agent spend (SiliconANGLE)
- Agent Iteration Budgets (LiteLLM docs)
- How to Prevent Infinite Loops and Spiraling Costs in Autonomous Agents (Codieshub)
- Agentic Resource Exhaustion: The “Infinite Loop” Attack of the AI Era (Medium)
- Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains (arXiv)
- AI Agent Cost Optimization in 2026: How to Cut Token Spend by 60% (NiteAgent)
- The Hidden Economics of AI Agents: Token Costs and Latency Trade-offs (Stevens Online)
- Token & Cost Tracking (Langfuse docs)
- 7 best free and open source LLM observability tools (PostHog)
- Langfuse vs Helicone vs Portkey: LLM Observability Compared (BuildMVPFast)
- AI Agent Token Budget Management: How Claude Code Prevents Runaway API Costs (MindStudio)
Was this useful?
Comments
Loading comments…