Career

50 Agentic AI Interview Questions Asked in 2026

Tiered from junior to staff, with the senior-architect answers — and a runnable lab for the concepts that are easier to show than to say. The questions that separate 'I read a blog' from 'I've shipped this'.

AgentSwarms Authors

May 23, 2026· 16 min read·—

Career

Agentic-AI interviews have a tell. Junior questions ask whether you can wire a loop. Senior questions ask whether you can keep a non-deterministic system from quietly destroying itself in production. If you only prepare definitions, you'll pass the first round and faceplant in the system-design one. This is the tiered question bank we'd study from — with the answers the interviewer is actually listening for.

Q1Compare orchestrator–workers vs peer-to-peer multi-agent patterns.

Q2How does MCP differ from raw function calling — and why standardize?

Q3How would you evaluate a non-deterministic agent with LLM-as-judge?

The bar rises from “can you wire a loop?” (junior) to “can you operate a non-deterministic system safely?” (staff). Senior answers are about trade-offs, not trivia.

The same topics get harder by level. Junior: 'what is it?' Mid: 'how do the patterns compare?' Senior/Staff: 'how do you operate it safely when it's non-deterministic?' Tap through the tiers.

Junior: do you understand the primitives?

Walk me through the ReAct loop. Thought → Action → Observation, repeated until the agent can answer. The point they want: each action is grounded in a real observation, not a guess.
In tool calling, who runs the code? Your code does — the model only emits a JSON request. This separation is what keeps keys and side-effects under your control.
Why does RAG reduce hallucination? It turns a closed-book memory test into an open-book exam: the model answers from retrieved facts, and can say 'not in the docs' instead of inventing.
What's a system prompt vs a user prompt? The system prompt sets persona and unbreakable rules and outranks the user turn — though, crucially, it's not a security boundary.

Mid: can you choose and combine patterns?

Orchestrator–workers vs peer-to-peer — when each? Orchestrator when steps need central control and accountability; peer-to-peer for emergent, debate-style reasoning. Expect a follow-up on cost.
How is MCP different from function calling, and why standardize? Function calling is one model calling your tools; MCP is a protocol so any agent can use any server — n+m instead of n×m integrations.
How do you evaluate a non-deterministic agent? Golden datasets + LLM-as-judge against a rubric, scored per dimension, calibrated against human labels, run in CI. 'I'd eyeball it' is a fail.
When does a single agent become a swarm? When one agent's tool list and instructions bloat its context and hurt tool choice — split into scoped specialists with a router.

Senior / Staff: can you run it in production?

This is where offers are won or lost. Senior answers aren't longer — they're about trade-offs, failure modes, and operations. A few that come up constantly:

Design observability for a system where the same input gives different runs. Per-run traces (nested spans), continuous evals, and alerting on quality/cost/latency drift — because you can't reproduce a bug you can't see, and you can't unit-test non-determinism.
Architect a prompt-injection defense for an agent with DB and email tools. Deterministic input/output guardrails, least privilege per tool, human-in-the-loop on risky actions, and treating all retrieved/tool content as untrusted. 'Tell the model not to' is not an answer.
How do you govern cost across a swarm? Bounded loops, right-sized model routing, per-tenant rate limits, and cost tracked per resolved task — not per call.
When would you argue AGAINST using agents? The senior signal. If a single prompt or a deterministic workflow solves it, an agent adds latency, cost, and failure modes for nothing. Knowing when not to reach for agents is the most senior answer there is.

A model answer, in full

One-liners get you through the screen; structured answers get you the offer. Here's what a strong response to “When would you argue against using agents?” actually sounds like — notice it leads with a decision rule, names the costs, and ends with a concrete example:

“My default is the simplest thing that works, and an agent is rarely the simplest thing. I'd argue against agents whenever the task is deterministic or the path is known in advance — because an agent trades reliability for flexibility I don't need. Every agent adds non-determinism, latency, token cost, and new failure modes like loops and drift. For example, 'extract these five fields from an invoice' doesn't need an agent — it needs a single structured-output prompt, which is cheaper, faster, testable, and can't wander. I reach for agents only when the path genuinely depends on what the system discovers at runtime, and even then I start with one agent before a swarm.”

The shape of a senior answer

Decision rule → the trade-offs you're weighing → a concrete example → what you'd do instead. If your answer is a list of buzzwords, you sound like you read a blog. If it's a trade-off with an example, you sound like you've shipped this and felt the pain.

Framework-specific questions

If a framework is on your résumé, expect to defend it. The interviewer is checking that you understand its mental model, not just its API:

LangGraph — 'What is the state object and why is it explicit?' (A shared, typed state passed along edges; explicitness is what makes loops, branches, and checkpoints debuggable.) 'How do you resume a long run after a crash?' (Checkpointing.)
CrewAI — 'Explain roles, goals, tasks, and the difference between a sequential and hierarchical process.' (Roles give agents persona/scope; a hierarchical process adds a manager that delegates.)
AutoGen / AG2 — 'How does a group chat decide who speaks next, and how do you stop it running forever?' (A speaker-selection policy + a max-round cap — the same bounded-loop discipline as everywhere else.)

# A favorite live question: "Sketch a ReAct loop from scratch."
# They're watching for the observe-then-reason ordering and a stop condition.
messages = [system_prompt, user_question]
for _ in range(MAX_STEPS):                 # bounded — always
    step = llm(messages, tools=TOOLS)
    if step.final_answer:
        return step.final_answer           # the intended exit
    result = run_tool(step.tool_call)      # YOUR code executes, not the model
    messages.append(observation(result))   # feed the real result back in
return "Couldn't finish within the step budget."  # graceful give-up

Foundations & model behavior

How is generative AI different from predictive ML? Predictive ML classifies or regresses over a fixed label space; generative models sample from a learned distribution to produce new tokens/pixels. The interview tell is naming distribution modeling, not "it writes text."
Explain self-attention and why it matters for LLMs. Every token attends to every other token in parallel, weighted by learned Q·K similarity — that's how the model resolves long-range dependencies without recurrence. The senior addition: it's O(n²) in sequence length, which is why context windows are expensive.
What are the practical challenges of a large context window? Cost and latency scale with tokens, and quality degrades in the middle of the prompt ("lost in the middle"). Bigger context ≠ better recall — retrieval + reranking usually beats stuffing.
Temperature vs top-k vs top-p (nucleus) sampling — differences? Temperature rescales the whole distribution; top-k truncates to the k most likely tokens; top-p keeps the smallest set whose cumulative probability ≥ p. Prod default is usually low temp + top-p ≈ 0.9.
Base model vs instruction-tuned model — when do you reach for each? Base models are next-token predictors; instruction-tuned models are fine-tuned (SFT + RLHF/DPO) to follow directions. Use base only when you're doing your own fine-tune or want raw completion behavior.

RAG in depth

What problem does RAG actually solve? Stale knowledge, hallucination, and lack of citation — by grounding generation in retrieved passages the model was not trained on.
RAG vs fine-tuning — how do you decide? RAG for changing facts and citations; fine-tuning for changing behavior, style, or format. They compose — fine-tune the tone, retrieve the facts.
How do you choose a chunking strategy? Start with semantic/recursive splits sized to your embedding model's sweet spot (~256–512 tokens) with overlap. Then measure retrieval hit-rate on a golden set — chunking is an eval problem, not a vibe.
What is hybrid search and when does it beat pure vector? BM25 (lexical) + dense vectors, fused via RRF. Wins whenever exact terms matter — product codes, error strings, names — where dense-only silently misses.
How do you evaluate a RAG pipeline in production? Decompose it: retrieval (hit@k, MRR), grounding (faithfulness), and answer quality (LLM-as-judge against a rubric). Log every retrieval so you can re-score after a model change.
What is GraphRAG and when would you use it? Build a knowledge graph over the corpus and query it alongside vectors. Wins on multi-hop questions ("which of our customers use X and had a Sev1 last quarter?") where flat chunks can't join.
Main failure modes of a naive RAG pipeline? Bad chunking, no reranker, retrieval hit but wrong passage on top, no query rewriting, no evaluation. "It works on my three test questions" is the classic red flag.
RAG vs agentic RAG — when is the extra complexity worth it? Agentic RAG lets the model plan retrievals (decompose the question, re-query, self-check). Worth it when questions are multi-hop; overkill for FAQ-style lookups.

Agents, patterns, and orchestration

What is an AI agent vs a workflow or chain? A workflow has a fixed path; an agent chooses its own next action from a tool set based on observations. Autonomy is the axis.
When would you NOT use ReAct? When the path is known — a deterministic workflow is cheaper, faster, and testable. ReAct earns its cost only when the next step genuinely depends on runtime observations.
How do you stop an agent from looping forever? Bounded steps, no-progress detection (state hash unchanged N times), tool-call deduplication, and a hard wall-clock budget. "The model will figure it out" is a fail.
How do you handle tool-call failures in production? Typed error results back to the model (not exceptions swallowed), retry with backoff for transient failures, and a circuit breaker so a broken tool doesn't burn your token budget.
RAG vs tool-calling — when do you use each? RAG for retrieving knowledge to answer from. Tool-calling for taking actions or fetching live data (APIs, DBs). Real systems use both.
When Plan-and-Execute over ReAct? When steps are expensive/side-effectful and you want a reviewable plan up front. Cost of being wrong: plans go stale when observations contradict them — need a replan trigger.
When should you actually reach for a multi-agent system? When one agent's tool list and instructions bloat context and hurt tool choice, or when you need parallelism. Not because "multi-agent" is on the roadmap.
What is a critic / verifier agent, and when is it worth the cost? A second pass that scores or rewrites the first agent's output. Worth it when the cost of a wrong answer >> 2× inference cost — code, medical, legal.
Design a multi-agent system where agents delegate. Router → specialists → aggregator, with a shared typed state, bounded hops between agents, and traces per agent. Name the failure mode: infinite handoff.
Pick LangGraph, CrewAI, AutoGen, or OpenAI Agents SDK — defend it. LangGraph for explicit state machines and checkpointing; CrewAI for role/task DX; AutoGen for group-chat research; Agents SDK for OpenAI-only stacks. The senior answer names what you'd give up.
When would you NOT use a framework and roll your own orchestration? When your control flow is 3 steps and you're paying framework tax in debuggability and dependencies. "Boring is a feature" is the senior signal.
A prod agent is misbehaving at 3am — walk me through debug. Pull the trace for the failing run, diff spans against a good run, isolate the first divergent tool call/observation, reproduce with the same seed/inputs, patch, add a regression eval.

Prompting, tools & MCP

Chain-of-thought — when does it actually help? On multi-step reasoning tasks with modern reasoning models it's often already implicit. Explicit CoT still helps on math/logic and when you want a scratchpad you can eval against.
Zero-shot vs few-shot vs fine-tuning — how do you choose? Zero-shot for capable models on common tasks; few-shot to teach a format cheaply; fine-tune when the pattern is stable, the volume is high, and prompts are getting too long.
System prompt vs user prompt — why does it matter? System sets persona and rules and outranks the user turn — but it's a soft prior, not a security boundary. Never put secrets there.
What is function calling / tool use? The model emits a structured request; your code executes and returns a result the model reads on the next turn. The model never runs code — that separation is what keeps side-effects controlled.
What is MCP and why does it matter? A standard protocol for tool servers so any agent can consume any server. Turns n×m custom integrations into n+m, and lets you compose ecosystems.

Memory, evaluation & safety

How do you implement persistent memory for a long-running agent? Short-term in context, mid-term as summarized turns, long-term in a vector store keyed by user/session with a recall tool. Don't put memory in the prompt — put a retriever in the prompt.
What is "Lost in the Middle" and how does it change agent design? LLMs recall the start and end of long contexts better than the middle. Rank retrieved passages so the most important are at the edges — and prefer retrieval over stuffing.
What is an evaluation harness and why does every production agent need one? A golden set + graders + CI that catches regressions before users do. Without one, every prompt change is a coinflip.
LLM-as-Judge — failure modes? Positional bias, verbosity bias, self-preference (a model rates its own outputs higher). Calibrate against human labels, randomize order, and use pairwise comparisons.
How do you test a non-deterministic agent? Property-based tests over trajectories, statistical assertions over N runs, and rubric-scored eval sets — not equality assertions on strings.
Prompt injection — how do you defend? Treat all retrieved/tool content as untrusted, use deterministic input/output guardrails, isolate tool privileges, and require HITL for irreversible actions. "Tell the model not to" is not a defense.
Security risks specific to autonomous agents? Injection via retrieved content, tool-abuse chains, data exfiltration through outbound tools, and over-broad service accounts. The mitigation is least privilege per tool, not a bigger system prompt.

Fine-tuning, system design & ops

What is LoRA and why is it preferred over full fine-tuning? Trains small low-rank adapters instead of all weights — cheaper, faster, and adapters are swappable per tenant/use-case without a new base model.
RLHF vs DPO — what are the trade-offs? RLHF uses a reward model + PPO (powerful, brittle, expensive). DPO optimizes preferences directly with a simpler loss — easier to run, less flexible for complex reward shaping.
Design a support agent handling 10k tickets/day at <2s response. Router (cheap model) → cached FAQ path → RAG path → escalation. Streaming tokens, async tool calls, per-tenant rate limits, and evals on every deploy.
Design a multi-tenant RAG system with strict data isolation. Per-tenant namespaces in the vector store, tenant-scoped API keys, row-level auth on metadata filters, and audit logs. Never rely on prompt-level "only use tenant X's data."
Design an eval platform for 50 different agent workflows. Shared trace schema, per-workflow golden sets, pluggable graders, drift alerts, and a UI where PMs can add cases. The staff answer is "evals as a product, not a script."
How do you implement HITL approval for high-risk agent actions? Model proposes → action serialized to a queue → human approves/rejects with reason → agent resumes with the decision as an observation. The trace must show who approved what.

Cost, latency & behavioral

How do you reduce LLM latency in a production chatbot? Stream tokens, cache prompts/responses, route easy turns to a smaller model, parallelize tool calls, and cut prompt size — retrieval + summarization beats stuffing.
How do you build a cost-efficient agent for both simple and complex tasks? Router that classifies difficulty and picks the model tier, cached tool results, bounded loops, and cost budgets enforced per resolved task — not per call.
Tell me about a time an agent misbehaved in production. What did you do? Structure: incident → detection signal → hypothesis → rollback/mitigation → root cause → what eval you added so it can't happen silently again. The last part is the senior signal.
How do you stay current in agentic AI? Concrete answer: which papers/newsletters, which repos you actually read, which frameworks you've kicked the tires on this quarter. "I follow AI Twitter" is a fail.
Describe convincing stakeholders that a more conservative agent design was right.* The senior tell: naming what you gave up (autonomy, wow-factor) to gain reliability, cost, and reviewability — and how you measured that the trade paid off.

The system-design round

Increasingly there's a dedicated agentic system-design interview: 'Design a customer-support agent platform.' They're checking whether you reach for the right building blocks and name the trade-offs unprompted.

Whiteboard the system — reveal what a strong answer names, one block at a time.

Orchestrator

Stateless workers

Memory & state store

Guardrails

Model gateway

Observability

Cost controls

The blocks a strong answer puts on the whiteboard. Reveal them one at a time — if you can name these and the trade-offs between them, you're interviewing at the senior level.

Show, don't just tell

The candidates who stand out demonstrate. Every concept here maps to something runnable in AgentSwarms — build the ReAct loop in the Agent Builder, wire a router-and-critic swarm on the canvas, or fix a Failure-Mode Lab live. 'Let me show you' beats 'I think' in every interview. Pair this with the /interview-questions page for the full bank.

Interview content ages well, but agentic AI moves fast — frameworks rebrand, protocols standardize, new failure modes get named. The durable prep isn't memorizing today's answers; it's building enough real swarms that the answers are just things you've seen happen. Go break a few on purpose. That's what the senior engineers across the table did.

Comments

Loading comments…