Question 1

What is generative AI, and how is it different from predictive ML?

Accepted Answer

Generative AI is a class of machine-learning models that produce new content — text, images, audio, code — by learning the underlying patterns of their training data and then sampling from a probability distribution over what comes next. The classic example is a large language model like GPT or Claude, which predicts the next token given everything before it. The important contrast with predictive ML is that predictive systems output a fixed answer like a class label or a number, whereas generative systems sample from a distribution, so they're inherently non-deterministic. That non-determinism is what makes them creative, but it's also why building production systems on top of them is so different from traditional ML — you need evaluation harnesses, you need to control temperature, you need output validation, and you need to design for the fact that the same input can produce slightly different outputs each time. Architecturally, the same transformer can be used either way; what makes a model generative is the decoding objective, not the network itself.

Question 2

Explain self-attention in a transformer and why it matters for LLMs.

Accepted Answer

Self-attention is the mechanism that lets every token in a sequence look at every other token and decide how much each one matters for understanding it. Each token gets projected into a query, a key, and a value vector, you compute attention scores by taking the dot product of queries with keys, scale and softmax them, and then use those scores to take a weighted sum over the values. The reason this matters so much for large language models is that it solves the long-range dependency problem that recurrent networks struggled with, and it's what makes in-context learning and few-shot prompting possible at all. The catch is that attention is quadratic in the sequence length, which is the single fact that drives almost every modern engineering decision in this space — KV caching, FlashAttention, sliding-window attention, mixture of experts, and the whole exploration into linear-time alternatives like state-space models. So when I think about why transformers won and why long context is expensive, the answer to both questions is the same — it's all attention.

Question 3

What is a context window, and what are the practical challenges of a large one?

Accepted Answer

The context window is the maximum number of tokens a model can read and reason over in a single pass, including both the prompt and the generated output. The obvious benefit of a large window is that you can fit more in-context examples, more retrieved documents, or longer conversations without truncating. The less obvious challenges are what really matter in production. First, attention is quadratic, so doubling the context roughly quadruples the compute and significantly grows memory pressure. Second, even when the model can technically read 200,000 tokens, recall actually degrades for information stuck in the middle of the context — there's a well-known U-shaped curve where the model remembers the beginning and the end much better than the middle. Third, longer context means higher latency and higher cost per call. So in practice, even with huge windows available, the right move is usually to keep prompts tight, retrieve only the most relevant chunks, and place the most important information at the start and the end rather than just stuffing everything in.

Question 4

What's the difference between temperature, top-k, and top-p (nucleus) sampling?

Accepted Answer

These three knobs all control how the model samples the next token, but they do it differently. Temperature scales the logits before the softmax — a lower temperature sharpens the distribution and makes the model more deterministic, a higher one flattens it and makes outputs more diverse. Top-k restricts sampling to the k most probable tokens and ignores the rest. Top-p, also called nucleus sampling, picks the smallest set of tokens whose cumulative probability crosses some threshold like 0.9 and samples from that. The reason most teams default to top-p over top-k is that top-p adapts to the shape of the distribution — when the model is confident, it samples from a tiny set, and when it's uncertain, it widens the pool. Top-k is constant and ignores how peaked the distribution is. In production, I generally use temperature zero for anything that has to be valid JSON or a tool call because I can't afford a syntax error, something low like 0.2 to 0.4 for grounded answers, and only push higher for genuinely creative tasks. With reasoning models, the guidance changes — providers explicitly recommend leaving temperature at one because the chain of thought is what controls determinism.

Question 5

What's the difference between a base model and an instruction-tuned model?

Accepted Answer

A base model is trained only on next-token prediction over huge amounts of text, so it's really good at continuing whatever you give it but it doesn't actually follow instructions — if you ask it a question, it might just write more questions. An instruction-tuned model takes that base and trains it further on instruction-and-response pairs, usually with supervised fine-tuning followed by some form of preference optimization like reinforcement learning from human feedback or direct preference optimization. The result is a model that responds to prompts the way a user expects. In production you almost always deploy the instruction-tuned version because base models will frustrate users immediately. But base models are still useful in specific situations — they're often a better starting point if you want to fine-tune for a very specific style, and they're often used directly for tasks like generating embeddings or classification heads where instruction-following isn't the goal. So the right framing isn't that one is better, it's that they're optimized for different jobs.

Question 6

What is Retrieval-Augmented Generation (RAG), and what problem does it solve?

Accepted Answer

Retrieval-Augmented Generation is a pattern where, instead of relying purely on what the model learned during training, you retrieve relevant documents from your own knowledge source at query time and inject them into the prompt so the model can answer using that context. The reason teams reach for it is that it solves three real LLM weaknesses at once — the knowledge cutoff, hallucination on specific facts, and the inability to access private or enterprise data without retraining. It's also much cheaper and faster to update than fine-tuning, and it's auditable because you can show citations for where the answer came from. The trade-off is that it adds retrieval latency and a brand new failure mode — bad retrieval. A naive cosine-similarity pipeline regularly retrieves chunks that look plausible but are wrong, and the model then confidently cites them. So most of the engineering effort in production RAG isn't really about the LLM at all; it's about getting retrieval right with hybrid search, reranking, good chunking, and faithfulness evaluations on the generated output.

Question 7

When would you use RAG vs fine-tuning?

Accepted Answer

I think of it as a decision tree. RAG is the right call when your knowledge changes frequently, when you need citations and auditability, when the corpus is too large to bake into weights, or when you're serving multiple tenants with isolated data. Fine-tuning, or more often LoRA fine-tuning, becomes the better choice when you need a consistent output format that the base model keeps drifting from, when you need very low latency and can't afford a retrieval hop, when you have at least a few hundred high-quality labeled examples for a narrow task, or when you need vocabulary the base model just doesn't know — things like medical codes or internal product names. The most common mistake is treating these as either-or. The classic production combo is to fine-tune the model for behavior and use RAG for facts. So you might fine-tune a smaller model to always respond in your support-ticket schema, and then RAG over your help center for the actual content. That gets you the best of both — predictable structure plus current, citeable knowledge.

Question 8

How do you choose a chunking strategy?

Accepted Answer

Chunking strategy should be driven by the kinds of questions users actually ask, not just by token counts. Fixed-size chunks with some overlap are simple and fine as a baseline. Recursive splitting that respects paragraphs and sentences usually beats fixed sizes because it keeps semantic units intact. Semantic chunking, where you group sentences by embedding similarity, helps when the corpus is messy. The deeper insight is that narrow factoid queries do better with smaller, focused chunks because the embedding stays sharp, while synthesis questions benefit from hierarchical chunking — retrieve a small chunk to find the right place, but pass the parent paragraph to the model so it has enough context. For structured documents like legal filings or technical manuals, chunking on headers beats any token-based scheme. The single best practice I'd call out is contextual retrieval — prepending a brief summary of which document and section a chunk came from before embedding it. That has been shown to dramatically reduce retrieval failures. And most importantly, you should always evaluate chunking with retrieval recall on a labeled set, not by intuition.

Question 9

What is hybrid search, and when does it outperform pure vector search?

Accepted Answer

Hybrid search combines dense vector retrieval with sparse keyword retrieval like BM25 and then fuses the rankings, often using something like reciprocal rank fusion. It outperforms pure vector search whenever the query depends on exact tokens that the embedding model doesn't capture well — product codes, error codes, version numbers, people's names, anything where the discriminative signal is the literal string rather than the meaning. BM25 nails those because it's literally counting term overlap, while dense embeddings tend to wash that detail out. Pure dense search wins on paraphrase and concept queries where the user's words don't match the document's words. In real enterprise corpora, the query mix is bimodal — you get both kinds — which is why hybrid almost always wins. The next move on top of that is to add a cross-encoder reranker on the top results. Bi-encoders are fast but coarse; cross-encoders score the query and document jointly and reorder the top candidates much more accurately. In published benchmarks, hybrid plus reranking can outperform dense-only by a wide margin on enterprise search workloads.

Question 10

How do you evaluate a RAG pipeline in production?

Accepted Answer

I split RAG evaluation into retrieval and generation because they fail differently and the fix for each is different. For retrieval, I build a labeled set of questions paired with the chunks that should come back, and I track recall at k and mean reciprocal rank. Bad retrieval silently degrades everything downstream and cosine similarity alone won't tell you. For generation, the metric I most care about is faithfulness — are the claims in the answer actually grounded in the retrieved context. That's the one I'd alert on in production because when faithfulness drops, it's almost always retrieval drift, like new documents getting ingested with bad parsing. I also like to add a lightweight unanswerable detector — if the top retrieval score is below a threshold, return I don't know rather than passing weak context to the model. And when using LLM-as-judge for any of this, I randomize the position of options to fight position bias and cap answer length to fight length bias. Without those guardrails, your eval numbers are theater.

Question 11

What is GraphRAG and when would you use it instead of vector RAG?

Accepted Answer

GraphRAG builds a knowledge graph of entities and relationships from your corpus during indexing, and then queries traverse that graph alongside vector retrieval. It significantly outperforms naive vector RAG on what I'd call global, synthesis-type questions — things like what are the dominant themes in this corpus, or how are these incidents related, where the answer requires combining information across many documents. Vector RAG is great for local lookup like what does the policy say about X, but it's bad at multi-hop relational questions like who knows what about X across our organization. The trade-off is that GraphRAG indexing is much more expensive — you're running a model to extract entities and relations on every chunk, so it can be 10 to 50 times the cost of a vector index. So you reach for it when relationships actually matter, when the corpus is stable enough that the indexing cost amortizes, and when you need auditability of multi-hop reasoning, like in fraud, compliance, or intelligence work. For typical Q&A over docs, vector RAG is still the right starting point.

Question 12

What are the main failure modes of a naive RAG pipeline?

Accepted Answer

The big ones I've seen are bad chunking, embedding-domain mismatch, no reranking, no off-topic guardrails, and no citation tracking so you can't even debug what went wrong. The way I'd actually walk through debugging is by symptom. If the right chunk is retrieved but the answer is still wrong, it's a generation or grounding issue — lower temperature and instruct the model to cite chunk IDs. If wrong chunks are coming back consistently, it's usually embedding model drift; re-evaluate on a labeled set and consider domain-tuned embeddings. If the right chunks are there but buried at position eight, add a reranker. If the answer cites a chunk that doesn't actually say that, you have a faithfulness failure and you need a verification step that compares claims to context. If the same query produces different answers each time, you're missing semantic caching. The hard-won lesson is that without traces, citations, and an eval suite, every one of these looks the same — so the first investment is observability, then you can actually fix things.

Question 13

What is an AI agent? How does it differ from a workflow or chain?

Accepted Answer

An agent is a system where a language model dynamically directs its own process and tool use — it decides what to do next, what tool to call, when to stop. That's distinct from a workflow, which is a system where models and tools are orchestrated through predefined code paths that you wrote. The keyword is autonomy over control flow. The cost of that autonomy is real — every loop step is another model call, so latency and token spend are higher and harder to predict, and behavior is harder to reason about. The principle I've internalized from Anthropic's writing on this is to prefer the simplest pattern that works — start with a single model call, escalate to a workflow, and only escalate to a true agent when the task genuinely requires dynamic decisions. Saying I default to agents for everything is actually a red flag because agents add cost, complexity, and failure modes you don't need for most problems. Most production systems people call agents are actually workflows with a tool-use step.

Question 14

Explain the ReAct pattern. When would you NOT use it?

Accepted Answer

ReAct interleaves reasoning and acting — the model thinks out loud about what to do, takes an action like a tool call, observes the result, and then loops. It's the foundation of most agent frameworks because it gives the model a clear way to plan and adjust. Where I'd avoid it is when latency matters, because every reasoning step is a full model call and a five-step ReAct loop is five times the latency of a direct call. I'd also avoid it when the task is simple enough that direct tool calling does the job, or when you need deterministic behavior — the verbalized thoughts add variance. There are also better-fitting alternatives now. Plan-and-execute decouples planning from execution with one big planning call followed by cheaper execution steps. Reflexion adds a self-critique step that boosts quality at the cost of more tokens. The honest production reality is that most things people call ReAct agents could be replaced by a two-step workflow — classify, then act — at a fraction of the cost.

Question 15

How do you stop an agent from getting stuck in an infinite loop?

Accepted Answer

I treat this as a layered defense problem because no single check catches everything. First, hard caps on iterations and on tokens and dollars per session — agents can spend faster than you can react. Second, loop detection by hashing the action name and normalized arguments, and halting after a few repeats. Third, progress-based termination — if the last several steps haven't actually changed the agent's plan or visible state, that's a stuck signal even if the actions look different on the surface. Fourth, a supervisor agent or watchdog that monitors the trajectory and can interrupt. Fifth, a user-visible escalation path — when the agent gets stuck, hand off to a human or fall back to a deterministic response rather than failing silently. And underneath all of it, structured trace logging on every step so you can replay loops post-mortem. Without that last piece, you'll never figure out why the agent got stuck, and you'll keep firefighting the same patterns instead of designing them out.

Question 16

How do you handle tool-call failures in a production agent?

Accepted Answer

The lazy answer is wrap each tool in try-catch and retry on failure. The real answer is to teach the model the error so it can recover. I return structured tool results that include a status, an error code, a hint like try a narrower query, and any retry-after timing. The model then uses that hint to do something different rather than looping the same broken call. I also distinguish three error classes — transient like a network blip or a rate limit, which I retry with jitter; semantic like a 4xx or a validation failure, which I return to the model without retrying; and permanent like missing auth, which halts and asks the user. For any tool that writes to the world, I always require an idempotency key so a retry doesn't double-charge a card or send two emails. And when reasonable, I add a fallback tool — if the structured database query fails, fall back to search, and if that fails, fall back to I don't know. That layered design is what separates a real production agent from a demo.

Question 17

How does RAG differ from tool-calling, and when would you use each?

Accepted Answer

RAG retrieves passive knowledge and injects it into the prompt — it's read-only and idempotent. Tool calling lets the model take actions in the world, like calling an API, querying a live system, or writing to a database. The reason it's important to keep these distinct is safety. RAG is fundamentally safe because it doesn't change anything; tool calls can mutate state, so they need approval flows, idempotency keys, and tighter guardrails. So the practical rule is RAG for stable knowledge that fits in a corpus — docs, manuals, policies — because it's cheap and cacheable. Tool calling for live data like current order status or real-time prices, and for any side effect like sending an email or refunding a charge. In practice, modern agents use both: RAG to ground what we know about this customer, plus tool calling to act on it. The simplest baseline pattern I reach for is a single model with retrieval, tools, and memory wrapped around it before adding any agentic loop on top.

Question 18

When should you actually use a multi-agent system instead of a single agent?

Accepted Answer

I actually push back when teams jump to multi-agent too early, because in my experience they over-reach before they've exhausted single-agent designs. The real reasons to go multi-agent are pretty narrow. First, genuine parallelism — for example researching five vendors at once where the work is independent. Second, verification, where a separate critic agent catches the generator's mistakes; that improves quality measurably but doubles cost. Third, hard role separation for safety, like a tool-using agent kept separate from a planner that has no tool access. Fourth, when the task chain is so long that a single agent forgets early steps. The bad reasons are that it sounds more advanced or that it's nice for each agent to have a personality. And once you do go multi-agent, you have to design against amplification loops, where two agents keep escalating each other's confidence. The fixes are a shared scratchpad with versioning, a supervisor with veto power, and hard cycle limits so things can't spiral.

Question 19

Design a multi-agent system where agents delegate tasks to each other.

Accepted Answer

I'd structure it as an orchestrator that decomposes the task and delegates subtasks to specialist workers, with a shared state object passing results back, and the orchestrator aggregating the final answer. The piece I'd really focus on is the contract between agents, because that's where multi-agent systems usually fail in practice. Each specialist should publish a capability descriptor — its name, when to use it, the input schema, the output schema, the rough cost, the typical latency. The orchestrator routes based on those descriptors instead of free-form prompting, which is far more reliable. I'd protect against three classic failure modes: circular delegation, by tagging messages with a depth counter and capping it; state pollution, by giving each subgraph its own state and only passing messages at the boundary; and trace fragmentation, by propagating a single trace ID end-to-end so you can reconstruct who did what. Humans in the loop sit at the orchestration layer, not inside every specialist, because that's where the context to make decisions actually exists.

Question 20

What is a 'critic' or 'verifier' agent and when is it worth the cost?

Accepted Answer

A critic agent reviews another agent's output for errors, policy violations, or quality problems before it ships. It's worth the extra tokens and latency whenever the cost of being wrong outweighs the cost of an extra model call. Concrete examples are code generation where a critic that runs the unit tests and feeds errors back catches real bugs, anything in legal or medical or financial output where a hallucination is a liability event, and agent-generated SQL before you actually execute it. The anti-patterns are critics on creative writing, which is too subjective to score reliably, and critics that share the same model and the same prompt as the generator, because they share the same blind spots. The fix is to use a different model or a different perspective prompt. The most powerful version of this is constitutional self-critique, where the model reviews its own draft against a written set of principles before responding. I also track critic flip rate — how often the critic actually changes the answer — because if it's too low, it's noise, and if it's too high, your generator needs work.

Question 21

What is chain-of-thought (CoT) prompting and when does it actually help?

Accepted Answer

Chain-of-thought prompting is when you ask the model to reason step by step before giving its final answer. It improves performance on tasks where the intermediate steps are verifiable — math, multi-step logic, code, structured reasoning. It mostly helps on larger models; it was identified as an emergent capability that shows up reliably above a certain model scale. Where it doesn't help, or actively hurts, is on simple classification, on retrieval tasks, and anywhere latency matters, because every reasoning token is a billed token and added latency. With reasoning models you actually shouldn't prompt for chain of thought because they do it internally and your please think step by step just wastes tokens. The pattern I prefer in production is structured chain of thought — ask for a scratchpad section followed by a clean final answer, so you get the reasoning benefit but you can strip the scratchpad before showing the user. That keeps the UX clean while still capturing the lift.

Question 22

What's the difference between zero-shot, few-shot, and fine-tuning?

Accepted Answer

Zero-shot is just describing the task in the prompt and asking for an answer with no examples. Few-shot is including a handful of input-output examples in the prompt so the model can pick up the pattern in context. Fine-tuning actually updates the model's weights on a labeled dataset, so the behavior is baked in instead of relying on the prompt. The interesting practical point is that well-chosen few-shot examples close most of the gap between zero-shot and fine-tuning for classification and extraction tasks, which is a lot cheaper than training. Fine-tuning starts to win when you need consistent format at scale, when latency matters and you can't afford a long prompt full of examples, or when you need vocabulary the base model doesn't really know. The most common mistake is using random few-shot examples — what you actually want is a diverse set that covers the edge cases of the task. The advanced version is dynamic few-shot, where at query time you retrieve the nearest past examples to the current input and inject those, which is how a lot of production extraction pipelines work.

Question 23

What's the difference between a system prompt and a user prompt?

Accepted Answer

The system prompt sets the persona, tone, capabilities, and guardrails for the whole conversation. The user prompt is each individual turn from the user. The technical reason this distinction matters is that providers train their models to weight the system prompt higher than user input, so user content can't easily override safety or persona instructions. That's actually the first line of defense against prompt injection, but it depends on the rule that you never put untrusted content inside the system prompt and that you clearly delimit where the system prompt ends and untrusted content begins, often with XML tags or special markers. There's also a real cost angle. Both major providers offer prompt caching where the cached prefix is much cheaper, which is why production systems put their long stable instructions and knowledge base headers in the system prompt and put the dynamic per-request content in the user turn. So the system-versus-user split is partly about safety, partly about cost, and partly about hierarchy of trust.

Question 24

What is function calling / tool use in LLMs?

Accepted Answer

Function calling is when you give the model a list of tool schemas in JSON, and instead of replying in plain text, the model returns a structured object naming the tool and its arguments when a tool would be useful. Your application then executes the tool and feeds the result back into the conversation. The crucial thing to understand is that the model is not actually executing anything — it's just emitting structured JSON, and your code is in charge of the actual call. That separation is what makes tool use safe and testable. The biggest lever for whether and how often a tool gets called is the quality of the tool description — clear, specific descriptions with examples beat vague ones by a wide margin. For high-stakes tools like writes or money movement, you should validate the JSON against a schema, attach idempotency keys, and put a human approval gate in front of irreversible actions. The Model Context Protocol is becoming the standard interface for this, so a tool you build once can plug into any compatible client without re-wiring.

Question 25

What is the Model Context Protocol (MCP) and why does it matter?

Accepted Answer

The Model Context Protocol is an open standard that defines how language model applications connect to external tools and data sources through a consistent JSON-RPC interface. The way I explain its importance is by analogy — MCP is to agents what USB-C is to devices, or what the Language Server Protocol is to code editors. Before it, every framework had its own tool format, so a Notion connector built for one didn't work in another. Now you build an MCP server once and any MCP-aware client can use it — desktop apps, IDEs, agent SDKs. For a production-grade MCP server, the things that matter are validating every tool input with a strict schema, never logging secrets, returning structured errors so the model can recover gracefully, and versioning your tool schemas so a client update doesn't break agents in production. The honest caveat is that MCP standardizes the surface but not the quality — a poorly described tool is still a poorly described tool, and that's still on you to get right.

Question 26

How do you implement persistent memory for a long-running agent?

Accepted Answer

I think of agent memory as four distinct things, because most candidates conflate them. Working memory is what's in the current context window — cheap and ephemeral, gone when the session ends. Episodic memory is the structured record of past interactions with timestamps and outcomes, stored in a regular database or vector store and retrieved by relevance. Semantic memory is facts and knowledge — that's basically your RAG corpus. And procedural memory is learned workflows or skills, sometimes baked into fine-tuned weights. To make episodic memory actually useful in production, I add a recency-weighted relevance score so old irrelevant memories don't crowd out new ones, explicit handling for contradictions when two memories disagree about the current state, time-to-live on memories that go stale, and a forget API for compliance reasons like the right to be forgotten. Most frameworks expose persistent thread state, but they don't solve contradiction or forgetting for you — that's still your engineering work, and it's where most agent memory systems quietly break.

Question 27

What is the 'Lost in the Middle' problem and how does it affect agent design?

Accepted Answer

Lost in the middle is a finding from a Stanford paper showing that language models recall information much better when it's at the start or the end of a long context, while information in the middle gets ignored or under-weighted. There's a clear U-shaped curve. That has direct design implications for agents. I put the most important instruction at the start and restate it at the end of the system prompt. I cap retrieved chunks at five to eight even when the context window could fit hundreds — quality beats quantity. I order retrieval results so the most relevant chunk is first, the second most relevant is last, and weaker ones go in the middle. I add needle-in-a-haystack probes to my eval suite — synthetic questions whose answer is placed at varying positions — so I can actually see when this starts hurting me. And when in doubt, I prefer retrieval over stuffing, even when stuffing would technically fit. The mental model is that long context isn't a free lunch — it's a leakier bucket the larger it gets.

Question 28

What is an evaluation harness and why does every production agent need one?

Accepted Answer

An eval harness is a suite of test inputs with expected outputs or behaviors that lets you measure agent quality consistently across changes. The way I'd describe its role to a non-technical stakeholder is that it's a regression test suite for prompts and models. Without it, you can't actually tell whether a model upgrade or a prompt tweak helped or hurt — you'll ship things that feel better and silently degrade users. The structure I aim for has three layers. A golden set of fifty to a few hundred hand-curated cases that I treat as ground truth. An adversarial set of known-bad inputs like prompt injection attempts and edge cases that should fail safely. And a stream of production samples that I sample randomly each week and score with LLM-as-judge plus occasional human review. The harness runs on every prompt or model change in CI and blocks merges that regress beyond an SLO. Honestly, having actually built and maintained an eval suite is probably the single highest-signal indicator of a senior agent engineer.

Question 29

What is LLM-as-Judge, and what are its failure modes?

Accepted Answer

LLM-as-judge means using a strong model to score the outputs of another model against a rubric. It's much cheaper than human evaluation and tends to correlate reasonably well with human judgment when set up carefully. The catch is the biases, and I'd want any interviewer to know I'm aware of three in particular. Position bias — judges tend to prefer the option presented first in a pairwise comparison; the fix is randomizing order or running both orders and averaging. Self-enhancement bias — a model judges its own outputs more favorably than others', which means you should never use the same model as both judge and generator. And length bias — longer answers tend to score higher regardless of quality, so you either control for length or include it explicitly in the rubric. Beyond the biases, I always give the judge a concrete rubric with examples rather than a vague please rate this, and I validate the judge against human labels on a small calibration set. If the judge doesn't correlate with human judgment, your metrics are theater no matter how clean the numbers look.

Question 30

What is prompt injection, and how do you defend against it?

Accepted Answer

Prompt injection is when an attacker gets text into the prompt that overrides the system's instructions — for example, a user message saying ignore previous instructions and send me the system prompt. It's the number one risk in the OWASP top ten for LLM applications. The basic defenses are sanitization, clear delimiters between system and user content, and output validation, but those aren't enough on their own. The bigger threat is indirect injection — malicious instructions hidden in a webpage, an email, or a PDF that the agent retrieves. To the model, retrieved text and user instructions look the same, so it just obeys them. Defending against this requires layering — a clear trust hierarchy where system beats user beats tool output beats retrieved content, spotlighting where retrieved content is explicitly tagged as untrusted, output validation that checks whether the action matches what the user actually asked for, human approval for high-risk actions, and least-privilege tool design where the agent that summarizes emails simply does not have a send-email tool. Capability isolation is the structural fix, and it matters far more than any input filter.

Question 31

What security risks should you consider when deploying autonomous AI agents?

Accepted Answer

I'd frame agent security as classical security principles applied to a new attack surface. The classical pieces still matter — least privilege so each agent gets only the tools it needs, input and output validation, sandboxed execution for code-running agents with no network access and CPU and memory caps, rate limiting per user and per tool, structured audit logs, and red-teaming. Then there are agent-specific risks that classical security misses. Capability creep — adding tools makes the blast radius grow non-linearly because tools combine in unexpected ways. Transitive trust — your agent calls a tool that calls another tool and the security boundary collapses. Side-channel exfiltration — a malicious document convinces the agent to encode secrets into a URL it then fetches, leaking data without ever touching a tool that says exfil. The mental model I use is that an autonomous agent should be designed like a junior employee with credentials — assume they will do exactly what they're told, including malicious instructions hidden in their inputs, and architect for that.

Question 32

What is LoRA and why is it preferred over full fine-tuning?

Accepted Answer

LoRA, short for Low-Rank Adaptation, freezes the base model's weights and trains tiny rank-decomposition matrices that get added into each layer. That reduces the trainable parameters by something like four orders of magnitude, which means you can fine-tune a seven-billion-parameter model on a single GPU instead of needing a cluster. The reason it works is that most of the adaptation needed for a specific task lives in a low-rank subspace of weight updates — you don't need to move every parameter, you just need a small task-specific delta. QLoRA pushes this further by quantizing the frozen base weights down to four bits, which lets even a sixty-billion-parameter model fit on a single high-end GPU. The production wins are huge — adapter files are megabytes instead of gigabytes, you can serve many adapters off the same base model with adapter swapping, and iteration is cheap. The trade-off is that the absolute quality ceiling is slightly lower than a full fine-tune, but for most use cases LoRA gets you the vast majority of the value at a tiny fraction of the cost.

Question 33

What is RLHF, and what are its known limitations? What about DPO?

Accepted Answer

Reinforcement learning from human feedback trains a reward model on human preference data and then uses reinforcement learning, usually PPO, to optimize the language model against that reward. It's the technique that turned base GPT-4 and Claude into models people actually want to talk to. The known limitations are real. Reward hacking — the model learns to game the reward model rather than genuinely improve. Labeler variance — preferences are noisy and culturally biased. KL collapse — without a strong KL penalty, the policy drifts off-distribution into weird outputs. And it's expensive and slow because you're training three models in sequence. DPO, Direct Preference Optimization, reformulates the math so you can skip the explicit reward model entirely — same preference data, single training stage, much simpler implementation. Most open-source post-training in the last couple of years uses DPO or its variants because it's just easier to get right. Constitutional AI from Anthropic goes a different direction by replacing most human labelers with an AI critic against a written constitution, which is what made their training process scale.

Question 34

Design a customer-support agent that handles 10,000 tickets/day with <2s response time.

Accepted Answer

I'd walk it through like a real architecture review. At the front door, a small fast classifier model routes incoming tickets into three buckets — deterministic FAQ templates for known questions, RAG-only Q&A for grounded informational queries, and a tool-using agent for actions that touch a customer's account. I'd target time-to-first-token under five hundred milliseconds because that's the latency users actually feel, and stream tokens out as they're generated. I'd cache aggressively — prompt caching for the long stable system prompt, and semantic caching for repeated questions. RAG would run over the help center plus per-customer context like recent orders and tickets, with hybrid search and a reranker. Any tool call that mutates state goes through an approval queue, with human approval required above a dollar threshold. Observability on every conversation, with alerts on faithfulness drops and escalation-rate spikes. On cost, the math typically lands somewhere around a fraction of a cent to a few cents per resolved ticket, versus several dollars for a human agent — and there's public reporting from companies like Klarna showing AI assistants handling the workload of hundreds of human agents at parity satisfaction.

Question 35

How would you design a multi-tenant RAG system where each tenant's data is isolated?

Accepted Answer

The first cut is namespace or metadata filtering at the vector store, with every query carrying a tenant ID and the retrieval layer enforcing it. RBAC sits at the API layer above. But the failure modes are what interviewers care about. Cross-tenant leakage is the P0 — one missed filter and another tenant's data leaks. The defense is in depth — filter at the vector store, filter at the API gateway, and validate the tenant ID in the metadata of every retrieved chunk after the fact, refusing to forward any chunk whose tenant ID doesn't match the requester. Embedding inversion is another risk — embeddings can leak source content, so I treat them as PII rather than as opaque vectors. Prompt injection from one tenant's documents trying to escalate access to another tenant's data needs capability isolation. For very sensitive or regulated tenants, I'd give them their own vector index — more expensive, but zero shared blast radius. On top of that, per-tenant rate limits, per-tenant cost caps, and per-tenant audit logs that record who asked what, what was retrieved, and what was returned.

Question 36

How would you build an evaluation platform for a company running 50 different agent workflows?

Accepted Answer

I'd treat the platform like a CI/CD pipeline for prompts, with three layers. First, offline regression — every prompt or model change runs the workflow's golden set in CI and blocks merges if quality regresses beyond an agreed SLO. Second, shadow or canary — the new version runs alongside the old on a small slice of production traffic, scored automatically, with rollback if metrics drop. Third, production telemetry — every conversation traced, sampled, and scored, with weekly trend reports per workflow. The metadata model tags each trace with the workflow, the version, the model, the tenant, latency, cost, and quality. Dashboards show quality versus cost versus latency on each version so PMs can see the trade-offs explicitly. For subjective metrics, I use an ensemble of judges plus periodic human spot-checks. The hard parts are not the LLM judge — they're dataset hygiene, the per-workflow rubric design, and the org change management to actually make engineering teams wait for evals before shipping. Tools like Arize Phoenix, Langfuse, or Braintrust handle the plumbing, but the discipline is on you.

Question 37

Describe how you would implement a human-in-the-loop approval step for high-risk agent actions.

Accepted Answer

The core flow is straightforward — pause the workflow before the high-risk action, persist the agent's full state, notify an approver with the context, and resume after the decision, using idempotency keys to prevent double execution. The details are where it gets interesting. I'd codify what counts as high-risk in policy, not in code comments — for example, any tool that writes, any action above a dollar threshold, any action affecting another user. The approval needs to include the full reasoning trace and the proposed effect, not just the action name, because approvers can't approve what they don't understand. There has to be a timeout policy that auto-denies or escalates after some number of hours. The idempotency key has to be generated before approval, attached to persisted state, and reused on resume so a network blip doesn't double-execute. And every approval and denial gets recorded in an audit log with the approver's identity and timestamp, which regulators do care about. Critically, you should only escalate genuinely high-risk actions, because if approvers are asked too often they'll rubber-stamp and the human checkpoint becomes worthless.

Question 38

How do you reduce LLM latency in a production chatbot?

Accepted Answer

Order matters here because it's about where the time actually goes. The first step is profiling, because in most chatbots retrieval and tool latency dominate, not the model itself. Then I'd attack in this order. Stream the first token, because time to first token is the metric users feel — I aim for under five hundred milliseconds. Use prompt caching, because both major providers offer cached-prefix pricing that's much faster and cheaper, so the long stable system prompt belongs there. Route by model — classify queries first with a cheap model and route only the genuinely complex ones to the flagship. Parallel tool calls — modern model APIs support parallel tool calling and you should never serialize independent calls. For self-hosted inference, speculative decoding, where a small draft model proposes tokens and the big model verifies in one pass, gets you a noticeable speedup. One thing I'd correct any junior on — temperature zero doesn't actually help latency, it just helps reproducibility, because the same compute happens either way.

Question 39

How do you build a cost-efficient agent that handles both simple and complex tasks?

Accepted Answer

I make this concrete by quantifying it. A typical support workflow might cost a tiny fraction of a cent for a classifier call, around a tenth of a cent for retrieval, half a cent for generation with a small model, and several cents with a flagship model. Routing eighty percent of queries to the small model gets you about an order of magnitude cost reduction, usually with minimal quality drop measured on your eval suite. On top of that, prompt caching can save fifty to eighty percent on long stable prefixes. Semantic caching catches near-duplicate questions, but you have to set the similarity threshold carefully — too low and you serve stale answers. I always set per-user and per-tenant budget caps with auto-disable, because there have been well-publicized incidents of agents running up huge bills overnight. I instrument cost-per-resolved-task as the real KPI, not cost-per-token, because tokens are an implementation detail. And for batchable workloads like overnight summarization or embedding refreshes, the batch APIs from major providers cut roughly half the cost.

Question 40

Tell me about a time an agent you built behaved unexpectedly in production. What did you do?

Accepted Answer

The structure I use is a STAR plus postmortem, because that's what senior interviewers actually want. I'd start with a specific observable symptom — say, faithfulness scores on our support agent dropped overnight by fifteen percent. I'd describe what I owned in that situation. Then the actions, in order — first an immediate mitigation to stop the bleeding, like flipping the kill switch or routing affected traffic to a human, then root-causing through traces rather than guessing, then a systemic fix, which usually means an eval test that would have caught it plus a monitoring alert plus a guardrail. Then a blameless postmortem written and shared. The result has to include numbers — incident closed in some number of minutes, regression test added, recurrence rate measured. The worst version of this answer is just I fixed the bug, because that signals you'll keep firefighting the same problem. The version interviewers grade well is the one where the systemic fix means the same class of bug can't ship again.

Question 41

How do you stay current in a field that changes as fast as agentic AI?

Accepted Answer

I'd point to active learning rather than passive consumption, because interviewers can tell the difference. Concrete signals I'd give — I build a small prototype every couple of weeks to actually test a new technique, whether that's prompt caching, an MCP server, or trying GraphRAG on a corpus I know well. I read or contribute to open source — looking at how LangGraph handles state, or how vLLM does paged attention, teaches you more than another blog post. I have an opinion on a recent paper or release that I can defend, like why I think contextual retrieval is going to become a default. And I share what I learn, even if it's just a brief internal write-up or a Loom for the team. For sources, I'd name two or three I actually use rather than rattling off ten — the engineering blogs from Anthropic and OpenAI, a podcast like Latent Space, and arXiv-sanity for paper triage. The thing I'd avoid is the answer everyone gives, which is I read a lot — that's true of every candidate.

Question 42

Describe a time you convinced stakeholders that a more conservative agent design was the right call.

Accepted Answer

I'd frame this with a specific example so it doesn't sound theoretical. The structure is — they wanted some level of autonomy, I argued for a guardrail, and the reason I won the argument was that I brought a measured risk model rather than just a feeling. For example, on an action that mutated billing, the PM wanted full agent autonomy. I went back with the eval data — at the agent's then-current accuracy of around ninety-two percent, an autonomous setup would create roughly eighty incorrect billing changes per thousand interactions. Each of those is a customer service escalation that costs real money to resolve, plus the customer trust impact that's harder to put a number on. We agreed on a human-approval gate above a dollar threshold, with explicit exit criteria — once we got the agent above ninety-nine percent on the eval, we'd revisit the threshold. That structure of measured risk, business cost translated into dollars, and agreed exit criteria is what senior interviewers grade on. It shows you can disagree well, you can collaborate, and you can speak the language of the people making the decision.

Question 43

How does an agent loop know when to stop, and what safeguards prevent a runaway loop?

Accepted Answer

The honest answer is that a single max-iterations cap is not enough — and the canonical reference for why is the LangChain incident in late 2025 where four agents with no step cap got into a clarification ping-pong that ran for eleven days and burned forty-seven thousand dollars. So when I design termination, I think about three independent safeguards, not one. First, a hard step ceiling per conversation, enforced in the orchestration layer, not in a prompt — because the model will happily ignore a prompt instruction once context drifts. Second, a per-conversation budget gate that is a real hard stop and returns a structured budget-exceeded response. Alerts are not enforcement — that's the precise lesson from the postmortem. And third, a duplicate-input hash so the agent can detect it has already seen this observation and refuse to recurse on it. If I have time I'll also mention context drift — as the history grows, the weight of the original system prompt falls relative to recent turns, and the model starts re-deriving steps it already finished. Putting an explicit completed-actions record inside the prompt helps a lot. And ideally the budget stop is resumable through a checkpointer like LangGraph's, so the agent can be reviewed and continued rather than restarted from scratch.

Question 44

When would you choose Plan-and-Execute over ReAct, and what's the cost of being wrong?

Accepted Answer

I treat this as a cost and recovery trade-off, not a style choice. Plan-and-Execute has two real advantages — you can parallelize independent subtasks, which is a big wall-clock win when each sub-LLM call is expensive, and you get a human-reviewable plan before you commit any execution spend. That's actually why Anthropic's orchestrator-workers pattern is a Plan-and-Execute variant under the hood. The brittleness is also real — if reality contradicts the plan, you either re-plan, which doubles your planner cost, or you push through with a stale plan and the executor starts doing things that don't make sense. ReAct flips that. Each step sees the last observation, so it adapts naturally — but every step is a full LLM call, errors compound across the chain, and the plan only exists implicitly so the trace is harder to debug. My decision rule is — if a re-plan is cheap and the path is genuinely unknown, like web research or debugging, I'll go ReAct. If the subtasks are independent and I can parallelize them, like generating a multi-section report or pulling data from several tools, I'll go Plan-and-Execute. The asymmetric cost matters — picking ReAct on a parallelizable task costs you latency; picking Plan-and-Execute on an exploratory task costs you re-planning loops on every surprise.

Question 45

It's 3am and a production agent is misbehaving for a customer. Walk me through how you debug it.

Accepted Answer

First thing I do is pull the trace, not the logs — and that distinction matters. In LangSmith or Arize Phoenix I can see every LLM call, every tool invocation, and the intermediate state, and the question I'm asking is which step deviated. Then I walk through the three failure classes the 2026 agentic-AI taxonomy survey calls out, in order — was it hallucination in action, where the model asserted something false and the next step treated it as fact? Was it an unbounded loop, where the step count is climbing but there's no goal-met signal? Or was it prompt injection through a malicious tool response? Once I've localized the step, I make one more split — is this deterministic or stochastic? If the same input always fails, it's a prompt or a config issue and I can hotfix. If it's intermittent, a hotfix will burn me and I need an eval harness to characterize it before I touch anything. At three in the morning the mitigation order is — feature-flag the misbehaving tool or fall back to the previous prompt version so the customer is unblocked, and only roll a real code change after I can reproduce it in an eval. The instinct I never let myself follow is re-running the agent and hoping for a different result — that's the tell that someone hasn't actually been on-call.

Question 46

How do you test a non-deterministic agent? Unit tests don't work — what does?

Accepted Answer

The mental shift that has to happen is moving from asserting on outputs to asserting on behaviors and constraints. Outputs are non-deterministic, but behaviors are stable enough to test. So instead of saying 'the answer must contain X', I'll assert things like 'it called a tool in the search or calc category', 'the step count stayed under eight', 'it never called a write-tool without a confirmation step', or 'every structured output passed schema validation'. Then I run each test input five to ten times and I assert on pass rate over repeated runs rather than exact equality. For test coverage, I use Arize's five failure modes as a checklist — hallucination cascade, context overflow, unbounded loop, tool misuse, cascading timeout — and I write at least one behavioral test per mode. For debugging, LangSmith or Arize lets me diff traces between runs to see which step actually changed. The trap I'd warn against is snapshot tests on outputs — they feel productive but they're a flake engine. They break on temperature noise, they break on model upgrades, and they tell you nothing about why something failed.

Question 47

Pick a framework for a multi-agent system: LangGraph, CrewAI, AutoGen, or OpenAI Agents SDK. Defend the choice.

Accepted Answer

I evaluate these against four axes that map directly to production failure modes — native step-cap support, native budget-gate enforcement, native observability, and native human-in-the-loop primitives — not developer experience or GitHub stars. On that grid, LangGraph is native on three of the four — step caps through StateGraph, observability through LangSmith and OpenTelemetry, and HITL through the interrupt mechanism that lets you pause, inspect, and modify state at a checkpoint. Budget gates you still wire up through callback hooks. AutoGen is native on step caps and HITL and gives you an actor model for distributed event-driven coordination, which matters at scale. CrewAI is the fastest to develop in, but almost every production safeguard is manual — that's the trade. The OpenAI Agents SDK is the lightest weight, very tight integration with the GPT tool surface, but you re-implement most of the safeguards yourself. My default leans LangGraph for anything that needs durability and resume-from-checkpoint, AutoGen when the system is distributed, CrewAI for prototypes and clearly bounded role workflows, and OpenAI's SDK when the integration story with GPT tools outweighs the orchestration work I'll have to own.

Question 48

When would you NOT use an agent framework, and just write the orchestration yourself?

Accepted Answer

I push toward primitives in three pretty specific situations. First, when the workflow is fixed and sequential — if the execution order doesn't change based on what the model says, it's not really an agent, it's a pipeline, and prompt chaining with three function calls is going to be more debuggable than an AgentExecutor wrapper. Second, when framework abstractions are hiding the failure modes I care about — historically the high-level LangChain abstractions made it harder to inspect intermediate state, and on primitives I can see every prompt, every tool response, and every state transition. Third, when latency or cost are first-class constraints — every framework layer adds overhead, and if I need a sub-hundred-millisecond agent path in production I'm usually going to direct API calls with custom orchestration. The principle I borrow from Anthropic's guidance is start simple, add complexity only when the simpler thing actually breaks. The signal I think senior interviewers are reading here is whether I have the judgment to not reach for the most impressive-sounding tool — because reaching for it by default is the easiest way to ship a system that's harder to debug than the problem it was meant to solve.

Question 49

Design a multi-agent system. What's the failure mode that interviewers most want to hear you defend against?

Accepted Answer

The failure mode interviewers want me to name explicitly is the coordination contract gap, and the canonical example is the LangChain forty-seven-thousand-dollar incident — four agents, no agent had authority to terminate, and the conversation ping-ponged for eleven days. So I start the design with the contract, not the topology. The contract has four parts. One — which agent has authority to terminate, and what signal does it send. Two — how does an agent distinguish 'task complete' from 'I'm confused', because the orchestrator needs to tell those apart. Three — what's the shared state schema, ideally versioned, so two agents can't silently disagree on the world model. And four — how do I bound coordination cost, because every inter-agent message is an LLM call, and a peer-to-peer chat with no termination clause is unbounded by construction. Topology then falls out of the contract — orchestrator-workers when subtasks are unpredictable and I want a single decider, supervisor-hierarchical when the roles are stable, and peer-to-peer only when I genuinely need agent negotiation, because peer-to-peer needs the strictest termination contract precisely because no one is in charge.

Question 50

What's the difference between RAG and 'agentic RAG', and when is the extra complexity worth it?

Accepted Answer

Classical RAG is a fixed pipeline — retrieve once, stuff the chunks into the prompt, generate. Agentic RAG turns retrieval into a decision the model gets to make in a loop — it can reformulate the query, retrieve again, decide whether the evidence is sufficient, optionally pick a different index, and only then answer. The cost is real — you're typically paying three to five times the LLM calls per query and adding latency — so the threshold for it being worth it has to be concrete. Worth it when the user query is under-specified and a single retrieval reliably misses, like multi-hop questions or ambiguous entities. Worth it when you have multiple sources and the right one isn't predictable from the query. And worth it when the answer needs to cite which retrieval it came from for a human reviewer. Not worth it for short factual queries, single-document workflows, or anything where p95 latency matters more than recall. The other shift people forget — the eval changes. RAGAS faithfulness and context relevance are still useful, but agentic RAG also needs trajectory eval — did the agent stop retrieving at the right point, or did it loop. Without that, you'll happily ship an agent that retrieves four times when one would have done.

GenAI & Agentic AI Interview Questions

Foundations