RAGProductionObservability

Keeping RAG Honest When Your Documents Change

Your RAG demo was perfect. Then someone edited a doc, deleted a page, and shipped a new policy — and your assistant kept citing the old one. Here's how to build a retrieval layer that doesn't quietly rot.

AgentSwarms Authors

May 22, 2026· 16 min read·—

RAGProductionObservability

The demo was flawless. We dropped a folder of policy PDFs into a knowledge base, wired up retrieval, and the assistant answered everything with crisp citations. Two weeks later, support escalated a ticket: the bot had confidently quoted a refund window that legal had changed a month earlier. Nobody touched the code. The documents had moved on without us.

This is the failure almost nobody teaches. Every RAG tutorial ends at “…and then it answers from your documents.” But documents are not a fixed thing you index once. They're alive — edited, versioned, deprecated, deleted, reorganized. The moment your corpus changes and your index doesn't, your beautifully grounded assistant starts grounding itself in the past.

If you only remember one sentence from this post, make it this one: a RAG system is only as fresh as its index, and your index does not update itself. Everything below is about closing the gap between what your documents say now and what your vector store thinks they say.

The quiet failure mode

Most production incidents are loud — a 500, a stack trace, a pager. Stale retrieval is the opposite. Nothing throws. The pipeline runs, the vector search returns chunks, the model writes a fluent, well-cited answer. It's just wrong, because the chunk it cited describes a world that no longer exists. The system is behaving exactly as designed; the design simply assumed the documents would hold still.

📄Source doc

→

✂️Chunk

→

🔢Embed

→

🗄️Vector storefresh

Retrieval returnsthe current text ✓

A document's journey through a RAG pipeline — and where it goes stale. Press play: the source gets edited, but the chunk sitting in the vector store still holds the old text. Retrieval happily returns it.

Why it's so easy to miss

Staleness has no error signal. Your dashboards stay green, latency is fine, costs look normal. The only symptom is a slow erosion of answer quality that you won't notice until a user — or worse, a customer — does.

Documents change in more ways than you think

“The docs changed” hides at least four distinct events, and each one needs a different response from your indexing layer:

Edits — a paragraph is rewritten, a number is updated. The document's identity is the same; its content isn't. You need to re-chunk and re-embed the affected parts.
New versions — v2 of a contract supersedes v1, but v1 may still be legally relevant. Now you have a versioning problem, not just a freshness one.
Deletions — a page is removed or a product is sunset. Its chunks must leave the index, or your assistant will keep citing a ghost.
Reorganizations — content is split, merged, or moved between files. Chunk boundaries shift, IDs you relied on disappear, and naïve diffing sees the whole corpus as “new.”

Deletions are the one teams forget. Adding fresh content feels like progress, so re-ingestion pipelines tend to upsert and call it a day. But a vector store that only ever grows is a vector store that never forgets — and in retrieval, a confidently-returned deleted chunk is indistinguishable from a current one.

Step one: detect what actually changed

Re-embedding your entire corpus on every change is simple and, for a few thousand documents, perfectly fine. It stops being fine the moment you have millions of chunks and an embedding bill to match. The scalable move is to only touch what changed — which means you need a cheap, reliable way to know what changed.

The workhorse here is content hashing. For every chunk (or every document, then every chunk), compute a stable hash of its normalized text. Store that hash alongside the vector as metadata. On the next ingestion run, hash the incoming content and compare:

ARefund policy intro…#a1f3unchanged → skip

BRefunds within 14 days…#b7e1unchanged → skip

CLegacy clause (removed)…#c4d8unchanged → skip

Only B gets re-embedded; A is skipped (free); C is removed from the index. That's the whole cost-saving idea.

Content-hash diffing. Edit a document on the left and watch its hash change — only the chunks whose hash moved get re-embedded; unchanged chunks are skipped; chunks that vanished from the source get tombstoned. Try editing or deleting one.

// A minimal change-detection pass over one document's chunks.
import { createHash } from "node:crypto";

const hash = (text: string) =>
  createHash("sha256").update(text.trim().replace(/\s+/g, " ")).digest("hex");

async function reconcile(docId: string, freshChunks: string[]) {
  // What's currently indexed for this document?
  const existing = await store.list({ filter: { docId } }); // [{ id, contentHash }]
  const existingByHash = new Map(existing.map((c) => [c.contentHash, c]));

  const seen = new Set<string>();
  for (const text of freshChunks) {
    const h = hash(text);
    seen.add(h);
    if (existingByHash.has(h)) continue;      // unchanged → skip (no re-embed)
    const vector = await embed(text);          // changed or new → embed
    await store.upsert({ id: `${docId}:${h}`, vector, text, contentHash: h, docId });
  }

  // Anything indexed but no longer present in the source was deleted.
  for (const c of existing) {
    if (!seen.has(c.contentHash)) await store.delete(c.id); // tombstone
  }
}

Normalize before you hash

Whitespace, smart quotes, and trailing newlines will wreck your diff — every chunk will look “changed” after a harmless reformat. Normalize aggressively (collapse whitespace, standardize quotes) so the hash reflects meaning, not formatting noise.

Keep a small ingestion manifest per source: the document's own version or last-modified timestamp, plus the set of chunk hashes you produced. On the next run you can skip untouched documents entirely before you even chunk them, and you have an audit trail of exactly what the index believed at any point in time.

Step two: choose a re-indexing strategy

Once you know what changed, you have to decide how to apply it. There's no single right answer — it's a trade between simplicity, cost, and how much you can tolerate a half-updated index serving live traffic.

Touches

only changed chunks

Cost / speed

$ · fast

Users see

briefly inconsistent mid-update

Cheapest. Re-embeds the diff, deletes the gone. Pair with periodic full rebuilds.

The three strategies you'll actually choose between. Toggle each one to see how it touches the index, what it costs, and what users see while it runs.

Full rebuild — re-chunk and re-embed everything from scratch into a clean index. Dead simple, immune to drift, and easy to reason about. It's also the most expensive and slowest, so it works best on small corpora or on a nightly cadence.
Incremental — use your hash diff to re-embed only changed and new chunks, and delete the gone ones, in place. Cheap and fast. The catch: while it runs, your index is momentarily inconsistent (some chunks updated, some not), which can produce briefly weird answers.
Versioned / blue-green — build the updated index beside the live one, validate it, then flip traffic over atomically. The gold standard for anything user-facing.

For most teams the pragmatic path is incremental updates for routine edits, with a periodic full rebuild as a safety net to wash out any drift, fragmentation, or chunking-logic changes that incremental updates can accumulate over time.

Versioned indexes: never serve a half-built index

The single highest-leverage practice for a serious RAG system is to treat your index like you treat application deploys: immutable, versioned, and swapped atomically. You don't edit production in place while users are hitting it — you build the new version, run it through checks, and cut over.

🗄️

index v1

LIVE

🗄️

index v2

rebuilt + validated

queries →kb-current→v1

The app always queries the stable alias. Flipping it is atomic — and instantly reversible.

Blue/green indexing. Queries keep hitting v1 while v2 is built and validated in the background. When v2 passes its evals, an alias flips and every new query goes to v2 — with zero downtime and an instant rollback if something looks off.

Most managed vector stores support this directly through aliases or namespaces: your application queries a stable name (say, kb-current) that points at a concrete underlying index (kb-2026-05-22). Re-indexing builds a new concrete index, you validate it, then you re-point the alias. Rollback is just pointing it back. No user ever sees a partially-updated state.

Carry metadata like your life depends on it

Every chunk should travel with its source id, document version, last-updated timestamp, and content hash. This is what makes incremental diffing, deletions, version filtering (“only answer from the current contract”), and debugging a bad answer possible. Thin metadata is the root cause of most “why did it retrieve that?” mysteries.

The chunk that lost its context

Even with a perfectly fresh index, there's a subtler failure that gets worse as documents grow and change: a chunk, ripped out of its document and embedded on its own, often loses the context that made it meaningful. A sentence like “The figure rose 18% in this period” is useless in isolation — which figure, which period, which company?

Contextual embeddings (popularized by Anthropic's contextual retrieval work) fix this cheaply: before embedding a chunk, prepend a short, document-aware blurb that situates it. You generate that blurb once per chunk with a fast, cheap model — and because the surrounding document rarely changes when a single chunk does, you can cache it and only regenerate context for chunks whose neighborhood actually moved.

What gets embedded

From the FY24 annual report, Acme Corp revenue section: “The figure rose 18% in this period.”

Query: “How much did Acme revenue grow in FY24?” — retrieved ✓

querychunk

Same chunk, two embeddings. On the left, the raw chunk embeds into an ambiguous region and loses to better-worded competitors. On the right, a one-line generated context is prepended before embedding — and the same query now lands it cleanly. Toggle the context on and off.

// Prepend a short, generated context before embedding each chunk.
const context = await llm.complete({
  model: "fast-cheap-model",
  prompt: `Document: ${docTitle}
Here is a chunk from it:
"""${chunk}"""
In one sentence, situate this chunk within the document so it stands alone.`,
});

const enriched = `${context}\n\n${chunk}`;
const vector = await embed(enriched); // embed the context + chunk together
await store.upsert({ id, vector, text: chunk, context, contentHash: hash(chunk) });

Pair it with hybrid search

Contextual embeddings raise recall, but exact terms (error codes, SKUs, names) still belong to keyword search. Blending dense vectors with a classic keyword index (BM25) and a reranker on top is the most reliable retrieval stack we know of — and it's resilient to the wording drift that comes with edited docs.

You can't fix what you can't see

Because staleness is silent, you have to go looking for it. Treat retrieval like any other production system and instrument it:

Log every retrieval — the query, the chunks returned, their scores, their source ids and versions. When an answer is wrong, you want to replay exactly what the model saw.
Track which chunks get cited — chunks that are retrieved but never useful are noise; chunks that are cited constantly are load-bearing and deserve extra care when their source changes.
Watch the freshness gap — alert when a source's last-modified time is newer than the index's last-ingested time for that source. That single metric catches most staleness before a user does.
Sample and review — periodically pull real queries and eyeball the retrieved context. Drift hides in the long tail.

Re-indexing without an eval is a coin flip

Here's the trap: re-indexing feels safe, so teams ship it blind. But a chunking tweak, a new embedding model, or a botched deletion can quietly tank retrieval quality — and you've now baked that regression into your fresh, confident-looking index.

The fix is to gate every re-index behind an evaluation, exactly like you'd gate a code deploy behind tests. Maintain a golden set — a few dozen representative questions with known-good answers and the chunks that should be retrieved. Run it against the candidate index before you flip the alias. If retrieval recall or answer faithfulness drops, the new index doesn't ship.

🛠️Build candidate

→

📊Golden-set eval

→

✅Flip alias → live

Eval score 0.91 / pass bar 0.80 — ships ✓

The re-index gate. A candidate index only goes live if it clears the golden-set eval. Watch a good rebuild pass and a regression get caught and rolled back.

Make the loop boring

Detect change → re-embed only what moved → build a new versioned index → run the golden-set eval → flip the alias if it passes → keep the freshness metric green. None of these steps is exotic. The teams whose RAG stays trustworthy are simply the ones who made this loop automatic and unglamorous.

A practical playbook

1Attach rich metadata to every chunk from day one: source id, version, updated_at, and content_hash. You can't add this retroactively without a full rebuild.
2Normalize text, then hash each chunk. Diff against the index to find edits, additions, and — critically — deletions.
3Re-embed only what changed; tombstone what's gone. Keep a periodic full rebuild as a drift-washing safety net.
4Build re-indexes into a new versioned index; never mutate the live one in place.
5Gate the cutover on a golden-set eval. Flip the alias only if quality holds.
6Use contextual embeddings + hybrid search + a reranker so retrieval survives wording drift.
7Instrument retrieval and alert on the freshness gap. Silence is not success.

Where this lands in AgentSwarms

We built the Knowledge Base in AgentSwarms with exactly these failure modes in mind. Documents are chunked and embedded for you, chunk inserts are idempotent (so a re-ingestion run won't duplicate or corrupt your index), and the UI surfaces an embedding_failed status so a silent half-indexed document doesn't slip by. You can feel the whole retrieval loop end-to-end — including what a broken one looks like — in the Failure-Mode Labs.

A note on scope

AgentSwarms is a learning and prototyping platform, not a production RAG runtime. The point of this post isn't to sell you our index — it's to give you the mental model and the playbook so that whatever you run in production stays honest as your documents change.

Your documents will keep changing. That's not a bug in your knowledge base — it's the whole reason it exists. Build the loop that keeps up with them, and your assistant stops being a snapshot of last month and starts being a reliable window into what's true right now.

Comments

Loading comments…