Structured OutputsFrameworksPython

Pydantic: The Contract Layer Your Agents Are Missing

Language models emit text. Your program wants objects it can trust. Pydantic is the validating border between the two — and once you've felt the difference, you stop letting an agent touch a tool without it. A walk from your first BaseModel to self-healing, type-safe agents.

AgentSwarms Authors

June 2, 2026· 18 min read·—

Structured OutputsFrameworksPython

The bug took three of us most of a morning. An invoicing agent had been quietly approving the wrong amounts for two days, and the trace looked fine — the model had clearly said the number. The problem was buried 40 lines downstream: the model had returned the total as the string "1,250.00", our code had happily concatenated it, and "1,250.00" + "49.00" is not a number you want on an invoice. Nothing threw. The types were a lie we'd agreed to believe.

We fixed it in the most boring way possible: we made that LLM call return a Pydantic model instead of a dict. The amount became a Decimal field, the comma-string was rejected at the boundary, and the bug — along with a dozen of its cousins we hadn't found yet — simply stopped being possible. This post is about why that small move is one of the highest-leverage things you can do when building agents, and how far it scales: from your first model to agents that fix their own mistakes.

If you remember one sentence

An LLM doesn't return data — it returns text that looks like data. Pydantic is what turns that text into something your program is allowed to trust, and it does it loudly, at the boundary, instead of silently three functions later.

The boundary problem

Every agent has the same seam running through it: on one side is a language model producing tokens; on the other is ordinary code that expects integers, dates, enums, and well-formed objects. The model is astonishingly good at producing things that look right and occasionally, unpredictably wrong in ways that matter. It will give you "three" where you wanted 3, "yes" where you wanted true, a kelvin it invented for a units field, or valid JSON wrapped in an apologetic paragraph.

The naive approach — json.loads() the reply and reach into the dict — works in the demo and rots in production. It pushes the failure as far as possible from its cause: the data is wrong now, but you find out later, somewhere else, in a stack trace that points at the wrong line.

What the model actually returned

{ "city": "Paris", "days": "three", "alerts": "yes" }

Silently wrong ✕

data["days"] is still the string "three". Forty lines later something does range(days) and crashes far from the real cause.

The model emits text; your program wants objects. Pydantic is the validating border between the two — it fails loudly at the edge instead of quietly three functions later.

The same messy model output, two ways. Toggle between hand-parsing it (and hoping) and validating it with Pydantic — which fails loudly at the boundary, before anything downstream runs on bad data.

Start here: what Pydantic actually is

Strip away the agent context and Pydantic is a simple idea: declare the shape of your data with Python type hints, and have those hints enforced at runtime. You write a class; Pydantic validates, coerces, and gives you a real typed object back — or a precise error explaining what was wrong.

from typing import Literal
from pydantic import BaseModel, Field, ValidationError

class WeatherQuery(BaseModel):
    city: str
    units: Literal["c", "f"] = "c"
    days: int = Field(ge=1, le=7)        # 1 to 7, enforced
    alerts: bool = False

# The string an LLM might emit:
raw = '{"city": "Paris", "units": "c", "days": 3, "alerts": "yes"}'

q = WeatherQuery.model_validate_json(raw)
# q.days is the int 3; q.alerts is coerced to the bool True
print(q.city, q.days, q.alerts)   # Paris 3 True

# And when the model misbehaves:
try:
    WeatherQuery.model_validate_json('{"city": "Paris", "days": 14}')
except ValidationError as e:
    print(e)   # days: Input should be less than or equal to 7

That's the whole beginner story. Each annotation is a contract: days isn't documented as an int, it's required to be one (between 1 and 7), and units literally cannot be anything but "c" or "f". The type hint stops being a comment and becomes an enforced gate.

class WeatherQuery(BaseModel) — click a field

Must be a string. Numbers, nulls, and missing values are rejected outright.

42 → Input should be a valid string

Each annotation is a contract the model must satisfy. The type hint isn't documentation — it's an enforced runtime check.

Anatomy of a model. Click each field to see exactly what its type and constraints reject — and the readable error you get back. This is the contract the model has to satisfy.

Use Pydantic v2 — and know why

Pydantic v2 rewrote the validation core in Rust (pydantic-core), making it roughly 5–50× faster than v1 depending on the workload. The API moved too: it's model_validate, model_validate_json, and model_json_schema now (the v1 parse_obj / .json() names are deprecated). If a tutorial uses the old names, it's pre-2023.

From parsing to contracts: structured outputs

Here's where it gets interesting for agents. A Pydantic model isn't just a validator you run after the model replies — it can shape what the model is allowed to reply in the first place. Every modern LLM API accepts a JSON Schema (as a tool definition or a response_format) and will constrain its generation to match. And Pydantic emits that schema for you.

print(WeatherQuery.model_json_schema())
# {
#   "properties": {
#     "city":   {"type": "string"},
#     "units":  {"enum": ["c", "f"], "default": "c"},
#     "days":   {"type": "integer", "minimum": 1, "maximum": 7},
#     "alerts": {"type": "boolean", "default": false}
#   },
#   "required": ["city", "days"], ...
# }

So a single class definition does triple duty: it documents the shape, it tells the LLM how to answer, and it validates the answer when it comes back. You never hand-write the JSON Schema, and you never hand-write the parser. That elimination of two error-prone, hand-maintained artifacts is a bigger deal than it sounds.

→

You write an ordinary Pydantic class with typed, described fields. This is the single source of truth.

One model definition does triple duty: it documents, it constrains the LLM, and it validates the reply. You never hand-write the schema or the parser.

One model, three jobs. Step through how a Pydantic class becomes the JSON Schema that constrains the LLM, then validates the reply back into a typed object. Define once, enforce everywhere.

The Instructor pattern

You can wire this up by hand, but the library most teams reach for is Instructor. It patches your LLM client so that you pass a response_model and get a validated Pydantic instance straight back — schema generation, the API call, parsing, and validation all handled.

import instructor
from pydantic import BaseModel

class Review(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    rating: int = Field(ge=1, le=10)
    summary: str

client = instructor.from_provider("openai/gpt-4o-mini")

review = client.chat.completions.create(
    response_model=Review,            # <- a Pydantic model, not a string
    messages=[{"role": "user", "content": "Summarise this review: ..."}],
)
# 'review' is a fully validated Review instance. review.rating is guaranteed 1–10.

Validators turn a parser into a guardrail

Types catch shape errors. But a lot of what you actually care about is semantic: a discount that can't exceed the price, an end date after a start date, a SKU that has to exist in your catalogue. Pydantic lets you attach custom checks with @field_validator (one field) and @model_validator (the whole object, after the fields are parsed).

from pydantic import BaseModel, field_validator, model_validator

class Booking(BaseModel):
    nights: int
    price_per_night: float
    discount: float = 0.0

    @field_validator("discount")
    @classmethod
    def discount_in_range(cls, v: float) -> float:
        if not 0 <= v <= 1:
            raise ValueError("discount must be a fraction between 0 and 1")
        return v

    @model_validator(mode="after")
    def discount_not_above_total(self):
        if self.discount * self.nights * self.price_per_night > 1000:
            raise ValueError("discount exceeds the policy cap of $1000")
        return self

Good LLM validation is just good validation

There's nothing LLM-specific about a validator that says 'a refund can't exceed the order total.' It's the same business rule you'd enforce on a web form. The shift in mindset is treating the model's output as untrusted user input — because that's exactly what it is.

The advanced move: self-healing agents

Now combine the two ideas — structured outputs and validators — and something genuinely powerful falls out. When validation fails, you don't have to crash. You have a precise, human-readable description of what was wrong. Feed that error straight back to the model as the next prompt, and it will usually fix its own mistake.

This is the heart of Instructor's max_retries: a failed ValidationError becomes a corrective message, the model tries again with the feedback, and you only see the result once it passes. Your validators effectively become the agent's quality bar — and the agent climbs to meet it.

Attempt 1🔴

{ rating: 11, summary: 'Great!' }

ValidationError: rating must be ≤ 10 — fed back into the prompt.

Attempt 2🟢

{ rating: 9, summary: 'Great product, minor gripes.' }

Model corrected itself from the error message. Returned ✓

The validation error isn't a dead end — it becomes the next prompt. The agent reads its own mistake and fixes it.

The self-correction loop. Toggle retries on and off: with a retry budget, a ValidationError isn't fatal — it's fed back as the next prompt, and the model reads its own mistake and fixes it.

from pydantic import field_validator

class Answer(BaseModel):
    rating: int = Field(ge=1, le=10)
    summary: str

    @field_validator("summary")
    @classmethod
    def must_be_grounded(cls, v: str) -> str:
        if len(v) < 20:
            # This message is shown to the model on retry — write it FOR the model.
            raise ValueError("summary too short; cite at least one concrete detail")
        return v

answer = client.chat.completions.create(
    response_model=Answer,
    max_retries=2,        # on failure, re-ask with the validation error appended
    messages=[...],
)

Write error messages for the model, not just the log

Once a validator's error can become a prompt, its wording matters. 'Invalid input' helps no one. 'rating must be an integer from 1 to 10; you returned 11' tells the model exactly how to correct itself on the next pass. Your error strings are now part of your prompt engineering.

Routing with discriminated unions

Agents constantly have to choose which of several things to do — search, send an email, issue a refund — each with completely different arguments. The fragile way is a free-form action string plus a bag of optional fields, validated by hand. The robust way is a discriminated union: a set of typed models tagged by a literal field, and Pydantic routes the model's output to exactly the right one.

from typing import Literal, Union
from pydantic import BaseModel, Field
from decimal import Decimal

class SearchAction(BaseModel):
    action: Literal["search"]
    query: str
    top_k: int = 5

class RefundAction(BaseModel):
    action: Literal["refund"]
    order_id: str
    amount: Decimal

class AgentStep(BaseModel):
    # The 'action' field discriminates which payload this must be.
    step: Union[SearchAction, RefundAction] = Field(discriminator="action")

# A refund with amount="a lot" can never parse into RefundAction —
# the malformed tool call is impossible by construction.

This is more than tidy code. It means a malformed tool call — the right action name with the wrong arguments — cannot reach your execution layer, because it never validates into the corresponding model. The agent's intent and its arguments are checked together, as a unit.

The model picks an action — the discriminator routes it to one exact payload shape

Validated against

SearchAction(query: str, top_k: int = 5)

{ action: 'search', query: 'GLP-1 side effects', top_k: 3 }

Rejected: an action='refund' with an amount of 'a lot' never reaches your payments code — the union won't parse it into a RefundAction.

One tagged union is a type-safe router. The agent's intent and its arguments are validated together, so a malformed tool call is impossible by construction.

A tagged union as a type-safe router. Pick the action the model chose and watch Pydantic validate it against exactly one payload shape — and reject a refund whose amount isn't a real number.

Where Pydantic sits in a whole agent

Once you start seeing the model's output as untrusted input, you notice the same boundary repeating all over an agent. It's not one feature — it's a posture you apply everywhere untyped data tries to get in.

Tool arguments

The single highest-value guard: the model's proposed tool call is validated against a schema before any code or API fires.

Pydantic isn't one feature in an agent — it's the validated boundary at every place untyped data tries to get in. Guard the tool-argument edge first; it pays for itself fastest.

The boundaries Pydantic guards in a real agent. Click each one — the tool-argument edge is the highest-value place to start, because that's where a hallucinated call turns into a real-world side effect.

Tool arguments — validate the model's proposed call before any API or database is touched. This single guard prevents the largest category of agent damage.
Tool results — parse third-party responses into models so an upstream schema change fails fast instead of silently corrupting state.
Final output — hand downstream systems a validated object, not a hopeful string.
State & memory — typed scratchpads and plans, so a corrupt step can't quietly poison the next.

PydanticAI: when the model becomes the type

In late 2024 the Pydantic team shipped PydanticAI, an agent framework built on exactly this philosophy — they describe the goal as bringing 'the FastAPI feeling' to GenAI. It reached v1.0 in September 2025 and has iterated hard since. The pitch: an agent whose inputs, outputs, tools, and dependencies are all validated by Pydantic models, with errors surfaced at development time rather than in production.

from dataclasses import dataclass
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext

@dataclass
class Deps:                       # typed dependencies, injected — testable & swappable
    customer_id: str
    db: "Database"

class SupportReply(BaseModel):    # the agent's validated output type
    answer: str
    escalate: bool
    refund_amount: float = 0.0

agent = Agent(
    "openai:gpt-4o",
    deps_type=Deps,
    output_type=SupportReply,     # the run is guaranteed to end as a SupportReply
    system_prompt="You are a support agent. Be precise and cite balances.",
)

@agent.tool
async def get_balance(ctx: RunContext[Deps]) -> float:
    return await ctx.deps.db.balance(ctx.deps.customer_id)

result = agent.run_sync("Can I get a refund?", deps=Deps("c-42", db))
print(result.output.refund_amount)   # a float — validated, typed, safe

Notice what the types buy you. output_type=SupportReply means the entire run is guaranteed to terminate as a validated SupportReply — if the model returns something malformed, PydanticAI retries or raises a typed exception. deps_type=Deps means your tools receive typed dependencies you can swap for fakes in a test, no monkey-patching required. The framework is, essentially, this whole blog post turned into a product.

Observability comes along for the ride

Because everything is typed and structured, PydanticAI integrates cleanly with Logfire (also from the Pydantic team) for tracing — you can watch each validated step, retry, and tool call. Structured data in, structured traces out. That's not a coincidence; it's the payoff of typing the boundaries.

Streaming + validation aren't mutually exclusive

Pydantic can validate partial objects as they stream, so you get the responsiveness of token-by-token output and the safety of a typed result. You don't have to choose between a fast UI and a validated one.

The honest costs and gotchas

This isn't a free lunch, and pretending otherwise does you a disservice. A few things to keep in your peripheral vision:

Validation has a cost. It's fast, but it's not free — in a hot loop over millions of objects, Pydantic's flexibility shows up on the profile. For pure high-throughput (de)serialization without rich validation, a leaner tool can win (more on that next).
Over-strict schemas can hurt the model. A schema with 40 required fields, deep nesting, and exotic constraints can confuse the LLM into worse outputs or constant retries. Keep schemas as flat and as forgiving as correctness allows; validate hard only where it matters.
Coercion can surprise you. Pydantic will helpfully turn "3" into 3 and "yes"-ish values into booleans. Usually a feature; occasionally a foot-gun. Reach for strict mode when you want a string to stay a string.
Retries cost tokens. Self-healing is wonderful until a pathological input loops to your max_retries ceiling on every request. Cap retries, and log how often you're hitting the ceiling — it's a quality signal.

So is there an alternative to Pydantic?

Yes — several, and a couple are genuinely better for specific jobs. The point isn't that Pydantic is the only tool; it's that it's the right default, and knowing when to deviate is the mark of someone who actually understands the trade-off.

Speed

3/5

Validation depth

5/5

LLM ergonomics

5/5

Ecosystem

5/5

Pydantic v2: The default for a reason: Rust-fast for its class, deep validation, and every major agent framework speaks it. Some overhead vs pure-speed tools.

Scores are directional, not benchmarks. The honest summary: Pydantic wins on ergonomics and ecosystem, msgspec on raw speed, Zod if you're in TypeScript — and provider-native outputs still lean on a schema you probably wrote in Pydantic.

The realistic landscape. Pick a library to see where it's strong and where it isn't. Directional scores, honest verdicts — there's a right answer for each situation, and it isn't always Pydantic.

msgspec — 2–5× faster than Pydantic v2 for (de)serialization, built on rigid typed Structs. The pick when raw speed in a high-throughput service is your bottleneck. The trade: thinner validation, fewer conveniences, smaller ecosystem.
dataclasses / TypedDict — standard library, zero dependencies, and no runtime validation. Your IDE checks the hints; the running program does not. Fine for trusted internal data, dangerous as a guard against what an LLM hands you.
attrs + cattrs — mature, fast, flexible class-building with separate structuring. Less batteries-included for the JSON-Schema-for-LLMs workflow; you wire more of the glue yourself.
marshmallow — battle-tested serialization/validation from the web-API era. Verbose (schema as a separate class) and predates the structured-output pattern, but solid where it already lives.
Zod — if your agent is in TypeScript, this is the answer, not a compromise. Schema-first, superb type inference, and first-class support in the JS LLM SDKs. It's the Pydantic of that world.
Provider-native structured outputs — OpenAI and Anthropic can constrain generation to a JSON Schema directly. Powerful and worth using — but you still need something to define the schema and validate edge cases after the fact. In Python, that something is almost always Pydantic.

The verdict

Use Pydantic v2 by default for Python agents — the ecosystem assumes it and the ergonomics are unmatched. Drop to msgspec when a profiler tells you to. Use Zod if you're in TypeScript. And lean on provider-native outputs as a complement, not a replacement — they constrain the model, but Pydantic still defines and verifies the contract.

A practical playbook

1Make every LLM call that feeds code return a Pydantic model, not a string or a bare dict. This one habit removes the most bugs.
2Let the model generate your JSON Schema (model_json_schema()); never hand-maintain it.
3Validate tool arguments before execution — it's the highest-value boundary in the whole agent.
4Encode business rules as validators, and write their error messages for the model, because they become retry prompts.
5Give the agent a small retry budget so validation errors self-heal — then log how often you hit the ceiling.
6Use discriminated unions for action/tool selection so malformed calls can't parse.
7Keep schemas flat and forgiving; validate hard only where correctness genuinely matters.
8Reach for msgspec or provider-native outputs only when a real constraint (speed, platform) tells you to.

Where this lands in AgentSwarms

This thinking is baked into the platform. The LLM Tool-Calling JSON Schema Generator turns a function description into a valid schema (the same artifact Pydantic would emit), so the tool-argument boundary is typed from the start. And when you export a swarm or an agent to LangGraph, CrewAI, the OpenAI Agents SDK, or Strands, the generated code uses typed, validated tool signatures rather than free-form dictionaries — the structured-output discipline travels with your design.

A note on scope

AgentSwarms is a learning and prototyping platform, not a production runtime. The aim here isn't to sell you a validator — it's to give you the mental model so that whatever you ship treats the model's output as what it is: untrusted text, one validation away from being something you can trust.

Language models will keep getting better at sounding right. They will not stop occasionally being wrong in ways that matter — that's the nature of the thing. The teams that build agents you can rely on aren't the ones with the cleverest prompts; they're the ones who put a validating border at every seam and refused to let an unverified string become an action. Pydantic is the cheapest, most boring, most effective way to draw that border. Draw it early.

Comments

Loading comments…