Pydantic: The Contract Layer Your Agents Are Missing
Language models emit text. Your program wants objects it can trust. Pydantic is the validating border between the two — and once you've felt the difference, you stop letting an agent touch a tool without it. A walk from your first BaseModel to self-healing, type-safe agents.
The bug took three of us most of a morning. An invoicing agent had been quietly approving the wrong amounts for two days, and the trace looked fine — the model had clearly said the number. The problem was buried 40 lines downstream: the model had returned the total as the string "1,250.00", our code had happily concatenated it, and "1,250.00" + "49.00" is not a number you want on an invoice. Nothing threw. The types were a lie we'd agreed to believe.
We fixed it in the most boring way possible: we made that LLM call return a Pydantic model instead of a dict. The amount became a Decimal field, the comma-string was rejected at the boundary, and the bug — along with a dozen of its cousins we hadn't found yet — simply stopped being possible. This post is about why that small move is one of the highest-leverage things you can do when building agents, and how far it scales: from your first model to agents that fix their own mistakes.
An LLM doesn't return data — it returns text that looks like data. Pydantic is what turns that text into something your program is allowed to trust, and it does it loudly, at the boundary, instead of silently three functions later.
The boundary problem
Every agent has the same seam running through it: on one side is a language model producing tokens; on the other is ordinary code that expects integers, dates, enums, and well-formed objects. The model is astonishingly good at producing things that look right and occasionally, unpredictably wrong in ways that matter. It will give you "three" where you wanted 3, "yes" where you wanted true, a kelvin it invented for a units field, or valid JSON wrapped in an apologetic paragraph.
The naive approach — json.loads() the reply and reach into the dict — works in the demo and rots in production. It pushes the failure as far as possible from its cause: the data is wrong now, but you find out later, somewhere else, in a stack trace that points at the wrong line.
data["days"] is still the string "three". Forty lines later something does range(days) and crashes far from the real cause.
The model emits text; your program wants objects. Pydantic is the validating border between the two — it fails loudly at the edge instead of quietly three functions later.
Start here: what Pydantic actually is
Strip away the agent context and Pydantic is a simple idea: declare the shape of your data with Python type hints, and have those hints enforced at runtime. You write a class; Pydantic validates, coerces, and gives you a real typed object back — or a precise error explaining what was wrong.
from typing import Literal
from pydantic import BaseModel, Field, ValidationError
class WeatherQuery(BaseModel):
city: str
units: Literal["c", "f"] = "c"
days: int = Field(ge=1, le=7) # 1 to 7, enforced
alerts: bool = False
# The string an LLM might emit:
raw = '{"city": "Paris", "units": "c", "days": 3, "alerts": "yes"}'
q = WeatherQuery.model_validate_json(raw)
# q.days is the int 3; q.alerts is coerced to the bool True
print(q.city, q.days, q.alerts) # Paris 3 True
# And when the model misbehaves:
try:
WeatherQuery.model_validate_json('{"city": "Paris", "days": 14}')
except ValidationError as e:
print(e) # days: Input should be less than or equal to 7That's the whole beginner story. Each annotation is a contract: days isn't documented as an int, it's required to be one (between 1 and 7), and units literally cannot be anything but "c" or "f". The type hint stops being a comment and becomes an enforced gate.
class WeatherQuery(BaseModel) — click a fieldMust be a string. Numbers, nulls, and missing values are rejected outright.
Each annotation is a contract the model must satisfy. The type hint isn't documentation — it's an enforced runtime check.
Pydantic v2 rewrote the validation core in Rust (pydantic-core), making it roughly 5–50× faster than v1 depending on the workload. The API moved too: it's model_validate, model_validate_json, and model_json_schema now (the v1 parse_obj / .json() names are deprecated). If a tutorial uses the old names, it's pre-2023.
From parsing to contracts: structured outputs
Here's where it gets interesting for agents. A Pydantic model isn't just a validator you run after the model replies — it can shape what the model is allowed to reply in the first place. Every modern LLM API accepts a JSON Schema (as a tool definition or a response_format) and will constrain its generation to match. And Pydantic emits that schema for you.
print(WeatherQuery.model_json_schema())
# {
# "properties": {
# "city": {"type": "string"},
# "units": {"enum": ["c", "f"], "default": "c"},
# "days": {"type": "integer", "minimum": 1, "maximum": 7},
# "alerts": {"type": "boolean", "default": false}
# },
# "required": ["city", "days"], ...
# }So a single class definition does triple duty: it documents the shape, it tells the LLM how to answer, and it validates the answer when it comes back. You never hand-write the JSON Schema, and you never hand-write the parser. That elimination of two error-prone, hand-maintained artifacts is a bigger deal than it sounds.
One model definition does triple duty: it documents, it constrains the LLM, and it validates the reply. You never hand-write the schema or the parser.
The Instructor pattern
You can wire this up by hand, but the library most teams reach for is Instructor. It patches your LLM client so that you pass a response_model and get a validated Pydantic instance straight back — schema generation, the API call, parsing, and validation all handled.
import instructor
from pydantic import BaseModel
class Review(BaseModel):
sentiment: Literal["positive", "negative", "neutral"]
rating: int = Field(ge=1, le=10)
summary: str
client = instructor.from_provider("openai/gpt-4o-mini")
review = client.chat.completions.create(
response_model=Review, # <- a Pydantic model, not a string
messages=[{"role": "user", "content": "Summarise this review: ..."}],
)
# 'review' is a fully validated Review instance. review.rating is guaranteed 1–10.Validators turn a parser into a guardrail
Types catch shape errors. But a lot of what you actually care about is semantic: a discount that can't exceed the price, an end date after a start date, a SKU that has to exist in your catalogue. Pydantic lets you attach custom checks with @field_validator (one field) and @model_validator (the whole object, after the fields are parsed).
from pydantic import BaseModel, field_validator, model_validator
class Booking(BaseModel):
nights: int
price_per_night: float
discount: float = 0.0
@field_validator("discount")
@classmethod
def discount_in_range(cls, v: float) -> float:
if not 0 <= v <= 1:
raise ValueError("discount must be a fraction between 0 and 1")
return v
@model_validator(mode="after")
def discount_not_above_total(self):
if self.discount * self.nights * self.price_per_night > 1000:
raise ValueError("discount exceeds the policy cap of $1000")
return selfThere's nothing LLM-specific about a validator that says 'a refund can't exceed the order total.' It's the same business rule you'd enforce on a web form. The shift in mindset is treating the model's output as untrusted user input — because that's exactly what it is.
The advanced move: self-healing agents
Now combine the two ideas — structured outputs and validators — and something genuinely powerful falls out. When validation fails, you don't have to crash. You have a precise, human-readable description of what was wrong. Feed that error straight back to the model as the next prompt, and it will usually fix its own mistake.
This is the heart of Instructor's max_retries: a failed ValidationError becomes a corrective message, the model tries again with the feedback, and you only see the result once it passes. Your validators effectively become the agent's quality bar — and the agent climbs to meet it.
The validation error isn't a dead end — it becomes the next prompt. The agent reads its own mistake and fixes it.
from pydantic import field_validator
class Answer(BaseModel):
rating: int = Field(ge=1, le=10)
summary: str
@field_validator("summary")
@classmethod
def must_be_grounded(cls, v: str) -> str:
if len(v) < 20:
# This message is shown to the model on retry — write it FOR the model.
raise ValueError("summary too short; cite at least one concrete detail")
return v
answer = client.chat.completions.create(
response_model=Answer,
max_retries=2, # on failure, re-ask with the validation error appended
messages=[...],
)Once a validator's error can become a prompt, its wording matters. 'Invalid input' helps no one. 'rating must be an integer from 1 to 10; you returned 11' tells the model exactly how to correct itself on the next pass. Your error strings are now part of your prompt engineering.
Routing with discriminated unions
Agents constantly have to choose which of several things to do — search, send an email, issue a refund — each with completely different arguments. The fragile way is a free-form action string plus a bag of optional fields, validated by hand. The robust way is a discriminated union: a set of typed models tagged by a literal field, and Pydantic routes the model's output to exactly the right one.
from typing import Literal, Union
from pydantic import BaseModel, Field
from decimal import Decimal
class SearchAction(BaseModel):
action: Literal["search"]
query: str
top_k: int = 5
class RefundAction(BaseModel):
action: Literal["refund"]
order_id: str
amount: Decimal
class AgentStep(BaseModel):
# The 'action' field discriminates which payload this must be.
step: Union[SearchAction, RefundAction] = Field(discriminator="action")
# A refund with amount="a lot" can never parse into RefundAction —
# the malformed tool call is impossible by construction.This is more than tidy code. It means a malformed tool call — the right action name with the wrong arguments — cannot reach your execution layer, because it never validates into the corresponding model. The agent's intent and its arguments are checked together, as a unit.
action — the discriminator routes it to one exact payload shapeaction='refund' with an amount of 'a lot' never reaches your payments code — the union won't parse it into a RefundAction.One tagged union is a type-safe router. The agent's intent and its arguments are validated together, so a malformed tool call is impossible by construction.
Where Pydantic sits in a whole agent
Once you start seeing the model's output as untrusted input, you notice the same boundary repeating all over an agent. It's not one feature — it's a posture you apply everywhere untyped data tries to get in.
The single highest-value guard: the model's proposed tool call is validated against a schema before any code or API fires.
Pydantic isn't one feature in an agent — it's the validated boundary at every place untyped data tries to get in. Guard the tool-argument edge first; it pays for itself fastest.
- Tool arguments — validate the model's proposed call before any API or database is touched. This single guard prevents the largest category of agent damage.
- Tool results — parse third-party responses into models so an upstream schema change fails fast instead of silently corrupting state.
- Final output — hand downstream systems a validated object, not a hopeful string.
- State & memory — typed scratchpads and plans, so a corrupt step can't quietly poison the next.
PydanticAI: when the model becomes the type
In late 2024 the Pydantic team shipped PydanticAI, an agent framework built on exactly this philosophy — they describe the goal as bringing 'the FastAPI feeling' to GenAI. It reached v1.0 in September 2025 and has iterated hard since. The pitch: an agent whose inputs, outputs, tools, and dependencies are all validated by Pydantic models, with errors surfaced at development time rather than in production.
from dataclasses import dataclass
from pydantic import BaseModel
from pydantic_ai import Agent, RunContext
@dataclass
class Deps: # typed dependencies, injected — testable & swappable
customer_id: str
db: "Database"
class SupportReply(BaseModel): # the agent's validated output type
answer: str
escalate: bool
refund_amount: float = 0.0
agent = Agent(
"openai:gpt-4o",
deps_type=Deps,
output_type=SupportReply, # the run is guaranteed to end as a SupportReply
system_prompt="You are a support agent. Be precise and cite balances.",
)
@agent.tool
async def get_balance(ctx: RunContext[Deps]) -> float:
return await ctx.deps.db.balance(ctx.deps.customer_id)
result = agent.run_sync("Can I get a refund?", deps=Deps("c-42", db))
print(result.output.refund_amount) # a float — validated, typed, safeNotice what the types buy you. output_type=SupportReply means the entire run is guaranteed to terminate as a validated SupportReply — if the model returns something malformed, PydanticAI retries or raises a typed exception. deps_type=Deps means your tools receive typed dependencies you can swap for fakes in a test, no monkey-patching required. The framework is, essentially, this whole blog post turned into a product.
Because everything is typed and structured, PydanticAI integrates cleanly with Logfire (also from the Pydantic team) for tracing — you can watch each validated step, retry, and tool call. Structured data in, structured traces out. That's not a coincidence; it's the payoff of typing the boundaries.
Pydantic can validate partial objects as they stream, so you get the responsiveness of token-by-token output and the safety of a typed result. You don't have to choose between a fast UI and a validated one.
The honest costs and gotchas
This isn't a free lunch, and pretending otherwise does you a disservice. A few things to keep in your peripheral vision:
- Validation has a cost. It's fast, but it's not free — in a hot loop over millions of objects, Pydantic's flexibility shows up on the profile. For pure high-throughput (de)serialization without rich validation, a leaner tool can win (more on that next).
- Over-strict schemas can hurt the model. A schema with 40 required fields, deep nesting, and exotic constraints can confuse the LLM into worse outputs or constant retries. Keep schemas as flat and as forgiving as correctness allows; validate hard only where it matters.
- Coercion can surprise you. Pydantic will helpfully turn
"3"into3and"yes"-ish values into booleans. Usually a feature; occasionally a foot-gun. Reach for strict mode when you want a string to stay a string. - Retries cost tokens. Self-healing is wonderful until a pathological input loops to your
max_retriesceiling on every request. Cap retries, and log how often you're hitting the ceiling — it's a quality signal.
So is there an alternative to Pydantic?
Yes — several, and a couple are genuinely better for specific jobs. The point isn't that Pydantic is the only tool; it's that it's the right default, and knowing when to deviate is the mark of someone who actually understands the trade-off.
Pydantic v2: The default for a reason: Rust-fast for its class, deep validation, and every major agent framework speaks it. Some overhead vs pure-speed tools.
Scores are directional, not benchmarks. The honest summary: Pydantic wins on ergonomics and ecosystem, msgspec on raw speed, Zod if you're in TypeScript — and provider-native outputs still lean on a schema you probably wrote in Pydantic.
- msgspec — 2–5× faster than Pydantic v2 for (de)serialization, built on rigid typed Structs. The pick when raw speed in a high-throughput service is your bottleneck. The trade: thinner validation, fewer conveniences, smaller ecosystem.
- dataclasses / TypedDict — standard library, zero dependencies, and no runtime validation. Your IDE checks the hints; the running program does not. Fine for trusted internal data, dangerous as a guard against what an LLM hands you.
- attrs + cattrs — mature, fast, flexible class-building with separate structuring. Less batteries-included for the JSON-Schema-for-LLMs workflow; you wire more of the glue yourself.
- marshmallow — battle-tested serialization/validation from the web-API era. Verbose (schema as a separate class) and predates the structured-output pattern, but solid where it already lives.
- Zod — if your agent is in TypeScript, this is the answer, not a compromise. Schema-first, superb type inference, and first-class support in the JS LLM SDKs. It's the Pydantic of that world.
- Provider-native structured outputs — OpenAI and Anthropic can constrain generation to a JSON Schema directly. Powerful and worth using — but you still need something to define the schema and validate edge cases after the fact. In Python, that something is almost always Pydantic.
Use Pydantic v2 by default for Python agents — the ecosystem assumes it and the ergonomics are unmatched. Drop to msgspec when a profiler tells you to. Use Zod if you're in TypeScript. And lean on provider-native outputs as a complement, not a replacement — they constrain the model, but Pydantic still defines and verifies the contract.
A practical playbook
- 1Make every LLM call that feeds code return a Pydantic model, not a string or a bare dict. This one habit removes the most bugs.
- 2Let the model generate your JSON Schema (
model_json_schema()); never hand-maintain it. - 3Validate tool arguments before execution — it's the highest-value boundary in the whole agent.
- 4Encode business rules as validators, and write their error messages for the model, because they become retry prompts.
- 5Give the agent a small retry budget so validation errors self-heal — then log how often you hit the ceiling.
- 6Use discriminated unions for action/tool selection so malformed calls can't parse.
- 7Keep schemas flat and forgiving; validate hard only where correctness genuinely matters.
- 8Reach for msgspec or provider-native outputs only when a real constraint (speed, platform) tells you to.
Where this lands in AgentSwarms
This thinking is baked into the platform. The LLM Tool-Calling JSON Schema Generator turns a function description into a valid schema (the same artifact Pydantic would emit), so the tool-argument boundary is typed from the start. And when you export a swarm or an agent to LangGraph, CrewAI, the OpenAI Agents SDK, or Strands, the generated code uses typed, validated tool signatures rather than free-form dictionaries — the structured-output discipline travels with your design.
AgentSwarms is a learning and prototyping platform, not a production runtime. The aim here isn't to sell you a validator — it's to give you the mental model so that whatever you ship treats the model's output as what it is: untrusted text, one validation away from being something you can trust.
Language models will keep getting better at sounding right. They will not stop occasionally being wrong in ways that matter — that's the nature of the thing. The teams that build agents you can rely on aren't the ones with the cleverest prompts; they're the ones who put a validating border at every seam and refused to let an unverified string become an action. Pydantic is the cheapest, most boring, most effective way to draw that border. Draw it early.
Further reading & references
- Pydantic AI — documentation
- PydanticAI on GitHub
- How to Use Pydantic for LLMs: Schema, Validation & Prompts — Pydantic
- Instructor — structured outputs, validation & retries
- Good LLM Validation Is Just Good Validation — Instructor
- The Complete Guide to Using Pydantic for Validating LLM Outputs — MachineLearningMastery
- Benchmarks: msgspec vs Pydantic v2 — msgspec docs
Was this useful?
Comments
Loading comments…