All posts
RAGRetrievalLlamaIndex

Retrieval, Chunking, and Reranking: The Parts of RAG That Actually Decide Quality

Everyone obsesses over the model. But the answer was decided three steps earlier — in how you split the text, how you searched it, and whether you bothered to re-rank. Here's the part of the pipeline that quietly makes or breaks RAG, plus an honest take on when LlamaIndex earns its place and when it's just ceremony.

AS
AgentSwarms Authors
June 13, 2026· 19 min read·
RAGRetrievalLlamaIndex

The bug report said the assistant was “hallucinating.” It wasn't. I pulled the trace, and the model had answered faithfully from the three chunks it was handed — they just happened to be the wrong three. The chunk that actually answered the question was sitting at retrieval rank 7, one slot outside the window we passed to the model. We hadn't given it a chance. We'd given it a worse problem and blamed it for failing.

I've watched a dozen teams reach for a bigger model when the real fix was upstream. RAG quality is mostly decided before the LLM ever runs — in three unglamorous steps: how you cut the text, how you find the candidates, and whether you re-rank them before stuffing the prompt. Get those right and a mid-tier model looks brilliant. Get them wrong and GPT-5 will confidently answer from garbage.

This post is about those three steps, in the order they bite you: chunking, retrieval, and reranking. Then the question everyone eventually asks — do I need LlamaIndex for this, or am I adding a framework to justify the diagram?

Retrieval is two jobs, not one

The single most useful mental model I can give you: retrieval has to do two things that pull in opposite directions. First, recall — cast a wide enough net that the right passage is somewhere in your candidates. Second, precision — make sure the few chunks you actually hand the model are the best ones, not just the on-topic ones.

A single vector search is decent at the first job and bad at the second. It will happily return 50 chunks that are all about the topic, but it's surprisingly bad at telling you which one answers the question. That's why serious RAG is almost always two-stage: a fast retriever casts the net, then a slower, sharper reranker decides the final order. Toggle the second stage on and off below and watch what reaches the model:

Stage 1 · dense retrieval
top 50
wide net · high recall
Stage 2 · reranker
skipped
Context → LLM
#1#2#3
The answer chunk is sitting at rank #7 — outside the window. The LLM never sees it and improvises.
Two-stage retrieval. Stage one trades precision for recall — grab 50 candidates, cheap and fast. Stage two re-scores them properly and keeps the best 3. Turn it off and the answer chunk (rank #7) never reaches the model.
The one-line version

Retrieve wide, rerank narrow, then generate. If you only add one thing to a naïve RAG pipeline this year, add the reranker — it's the highest-leverage, lowest-effort upgrade in the stack.

Chunking: the decision you make once and regret for months

Everything downstream inherits your chunking. The embedding model can only embed what you give it; the retriever can only return chunks that exist; the reranker can only reorder what was retrieved. If you split a procedure across two chunks so neither one is complete, no amount of reranking will reassemble it. Chunking is the foundation, and like most foundations, nobody notices it until it cracks.

The core tension is simple. Small chunks match precisely — a tight passage about exactly one thing embeds into a sharp, specific vector. But small chunks are fragments: the model gets a sentence with no surrounding context. Large chunks carry their context with them, but the embedding has to average a paragraph of mixed ideas into one vector, so it matches everything weakly and nothing strongly — and it drags noise into your prompt. Drag the size around and watch both gauges fight each other:

chunk size 256 tok
Match precision84%
Context completeness66%

Sweet spot. A coherent passage that stands on its own. Precise enough to match, whole enough to answer. Most prose lives here.

The chunk-size tradeoff. Precision falls as chunks grow; context completeness rises. For most prose the sweet spot lives around 256–512 tokens — big enough to hold one whole idea, small enough to stay specific.

Sizes are only half of it — how you split matters as much as how big. The strategies I actually reach for, roughly in order of how often:

  • Recursive / structure-aware — split on the document's own boundaries (headings, paragraphs, then sentences) before falling back to a character count. This is the sane default. It respects the shape of the text instead of guillotining mid-sentence.
  • Sentence-window (small-to-big) — embed single sentences for precise matching, but at retrieval time return the sentence plus its neighbours. You get the precision of small chunks and the context of large ones. This one quietly fixes a lot of “the answer was half-there” complaints.
  • Parent-document — index small child chunks, but feed the model the larger parent they came from. Same idea as sentence-window, coarser granularity.
  • Semantic — split where the topic actually shifts (using embedding distance between adjacent sentences) rather than at a fixed length. Lovely in theory, more expensive, and worth it mainly for long, rambling documents where fixed sizes cut across ideas.
  • Fixed-size with overlap — the blunt instrument. A fixed token count with ~10–20% overlap so a fact straddling a boundary survives in at least one chunk. Fine for uniform text; crude for structured docs.
Overlap is not optional

If you split with zero overlap, every chunk boundary is a place where a fact can be cut in half and lost from both sides. A little overlap is cheap insurance. Zero overlap is a guaranteed class of silent retrieval misses.

There's no universal best chunk size, and anyone who quotes you one hasn't seen your documents. Dense API references want different treatment than chatty support articles. The honest workflow is: pick recursive splitting at ~400 tokens with overlap as a starting point, build a small eval set of real questions, and measure retrieval before you touch the model.

Why dense retrieval alone disappoints

Dense (vector) retrieval works by embedding your query and your chunks into the same space and grabbing the nearest neighbours. It's fast, it scales to millions of chunks, and it captures meaning that keyword search misses. It's also, on its own, a blunt ranker. Here's the failure I see constantly: the top candidates all score within a hair of each other — 0.81, 0.80, 0.79 — because the embedding can tell they're all on topic but can't tell which one actually answers the question.

Watch it happen. Below, a bi-encoder has returned seven on-topic chunks with nearly-identical scores. The one that contains the real answer is buried at rank 6. Hit the reranker:

1Pricing tiers overview0.81
2Refund policy summary0.80
3Account settings FAQ0.79
4Billing cycle basics0.78
5Plan comparison table0.78
6Exact cancellation steps + cutoff0.76
7Contact support hours0.75

Bi-encoder scores are bunched between 0.75 and 0.81 — it can tell these are all on-topic, but not which one actually answers.

Dense scores are bunched (0.75–0.81) — the retriever knows these are all relevant but can't separate them. The cross-encoder reads each chunk against the query and the real answer leaps from #6 to #1.

This is not a sign your embeddings are bad. It's structural. A bi-encoder embeds the query and the document separately and never lets them interact — it's comparing two summaries from across the room. That's exactly what makes it fast enough to search millions of chunks, and exactly what makes it imprecise at the top.

What a reranker actually is

A reranker is almost always a cross-encoder, and the difference from your retriever is the whole story. A bi-encoder runs the query and a document through the model separately and compares the two output vectors. A cross-encoder concatenates them — [query + document] — and runs them through the model together, so every word of the query can attend to every word of the document. Then it emits a single relevance score. Flip between the two architectures:

query
encoder
→ vec q
document
encoder
→ vec d (precomputed)
Compute
embed once, reuse
At query time
ANN over millions
Trade
fast · coarse

Two separate towers. Document vectors are baked ahead of time, so search is just nearest-neighbour math — milliseconds over millions of chunks.

Bi-encoder vs cross-encoder. The bi-encoder's two towers let you precompute document vectors and search them with nearest-neighbour math. The cross-encoder sees query and document together — far more accurate, but you can't precompute it.

That “can't precompute it” line is the entire reason for two stages. A cross-encoder has to do a fresh forward pass for every query–document pair, so running it over your whole corpus at query time is hopeless — that's millions of forward passes per question. So you don't. You let the cheap bi-encoder narrow a million chunks down to ~50, then spend the cross-encoder's expensive attention only on those 50. You get cross-encoder precision at bi-encoder scale. That's the trick, and it's most of why reranking feels like magic the first time you add it.

Rule of thumb for the window sizes

Retrieve 25–100 candidates from the vector store, rerank, and keep the top 3–5 for the prompt. Retrieving too few starves the reranker (it can't promote a chunk that was never fetched). Keeping too many after reranking just re-adds the noise you paid to remove.

Popular reranker models, and when to reach for each

The good news is you rarely train these. There's a healthy market of hosted APIs and open weights, and the choice mostly comes down to four questions: how much latency you can spend, whether the data can leave your network, whether you need multiple languages, and what you're willing to pay. Pick a constraint and see what fits:

Cohere Rerank 3.5APIhostedlatency: lowStrong general-purpose default. Multilingual, no infra.
Voyage rerank-2APIhostedlatency: lowTop-tier quality on retrieval benchmarks; pairs with Voyage embeddings.
Jina Reranker v2API + openeitherlatency: low
BGE-reranker-v2-m3openself-hostlatency: med
mxbai-rerank-largeopenself-hostlatency: medCompetitive open weights from mixedbread.
ms-marco-MiniLM-L6openself-hostlatency: tiny
ColBERTv2openself-hostlatency: lowLate interaction — a middle ground between bi- and cross-encoder.

Highlighted rows fit the selected need. Most teams start with a hosted API (Cohere/Voyage) and move to BGE or MiniLM when cost or data-privacy says self-host.

Reranker landscape by constraint. Hosted APIs (Cohere, Voyage) are the fastest path to good results. Open weights (BGE, mxbai, the MiniLM cross-encoders) win when cost or privacy says self-host.

The version I give people who don't want to read a table:

  • Just want it to work, data can leave the network? Start with Cohere Rerank or Voyage rerank. One API call, multilingual, genuinely strong. You'll have a better pipeline this afternoon.
  • Need to self-host (privacy, cost, air-gapped)? BGE-reranker-v2-m3 is the strong free default — multilingual, runs on a modest GPU, no per-call bill. mxbai-rerank is a solid alternative.
  • Latency-critical or CPU-only? The classic ms-marco-MiniLM cross-encoders from sentence-transformers are tiny, fast, and have been the reliable baseline for years. Not the highest ceiling, but hard to beat on cost per millisecond.
  • Want a middle ground between retriever and reranker? ColBERT (late interaction) scores at the token level and lands between a bi-encoder's speed and a cross-encoder's precision — more infrastructure, but a real option at scale.

Whatever you pick, the integration shape is the same — score the candidates, sort, truncate:

# Stage 1: dense retrieval casts the wide net (recall)
candidates = vector_store.search(query, top_k=50)

# Stage 2: cross-encoder reranks for precision
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs)              # one forward pass per pair
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

# Keep only the best few for the prompt
context = [c for c, _ in ranked[:4]]

Forty lines, no framework, and it's the biggest quality jump most RAG systems will ever get. Which is a good moment to talk about frameworks.

Where LlamaIndex fits — and what it actually adds

If you've only ever heard “LlamaIndex is a RAG framework,” here's the more useful framing: it's a set of opinionated abstractions over the exact seven layers you'd otherwise hand-write. Loaders, splitters, embeddings, the index, the retriever, the postprocessors (where rerankers live), and the response synthesizer. Nothing it does is impossible without it — the question is whether you want to own that code. Toggle between the raw stack and the LlamaIndex version:

1for-loop over files + PDF parseryou write it
2your own text splitteryou write it
3embed API calls + batchingyou write it
4vector DB SDK + upsertsyou write it
5hand-written top-k queryyou write it
6glue code for the rerankeryou write it
7f-string prompt assemblyyou write it

Every layer is yours to build and maintain. Total control, zero indirection — and a lot more code to keep correct as requirements grow.

Same seven layers, two ownership models. Raw: every box is code you write and maintain. LlamaIndex: each box is a swappable object — change a splitter or drop in a reranker by editing one line.

So what's the real difference between “normal retrieval” and “retrieval with LlamaIndex”? With the raw approach you call an embeddings API, talk to your vector DB's SDK (pgvector, Qdrant, Pinecone), and write the query, the reranker glue, and the prompt assembly yourself. With LlamaIndex, those become configured components — and, more importantly, you get the advanced retrieval patterns for free instead of building each one:

  • Heterogeneous loaders — 100+ connectors via LlamaHub, so Notion + Postgres + a folder of PDFs all become the same node type without three bespoke parsers.
  • Advanced retrieval out of the box — sentence-window, auto-merging, recursive/small-to-big, metadata filtering, and query transforms like HyDE or sub-question decomposition. These are exactly the patterns that are annoying to build by hand.
  • Rerankers as one-line postprocessorsCohereRerank, SentenceTransformerRerank, or an LLM reranker drop into the query engine as a node postprocessor. Swapping rerankers is a config change.
  • Index types beyond plain vector search — summary, tree, keyword, and property-graph indices when a flat top-k isn't the right structure for your data.
The honest value

LlamaIndex's real payoff isn't the first vector search — that's a weekend either way. It's the third, fourth, and fifth retrieval pattern you'd otherwise build and maintain yourself, plus a consistent way to swap components as requirements shift.

When LlamaIndex is overkill

I like LlamaIndex. I also routinely talk people out of it. A framework is a trade: you exchange a pile of glue code for a pile of abstractions, and abstractions have a cost — indirection when you debug, a version-churn tax, and the moments where the framework's opinion fights yours and you spend an afternoon learning its way to do a thing you could have written in ten lines. That trade is worth it when complexity is genuinely high and a waste when it isn't. Move the scenarios around:

simple · single source
complex · many sources
raw SDK wins
LlamaIndex earns its keep

A single source and a plain top-k query. A vector-DB SDK and 40 lines do this; LlamaIndex is overkill.

When the framework earns its keep. Single source plus a plain top-k query: raw SDK wins, LlamaIndex is ceremony. Many sources plus multi-step retrieval: the abstractions save real work.

Concretely, skip the framework when most of these are true:

  • You have one source and one retrieval pattern — a single pgvector table, plain top-k. A vector-DB SDK and ~40 lines (including the reranker above) cover it with less surface area than the framework's config.
  • Your corpus is small and fairly static — a few hundred to a few thousand chunks. Honestly, plain dense retrieval plus a reranker is often all you need; the fancy index types solve problems you don't have.
  • You need tight control over latency or the exact prompt — frameworks add layers between you and the wire, and that's the last place you want surprises in a hot path.
  • Your team will maintain this for years and values reading plain code over learning a framework's release notes.

And reach for it when the opposite is true: many heterogeneous sources, retrieval that needs routing or multiple hops, a fast-moving prototype where you're trying five patterns this week, or a team that would rather configure than build. The mistake in both directions is the same — choosing the tool before you've described the job.

A default recipe that holds up

If you want somewhere to start that won't embarrass you in production, this is the stack I reach for before I know anything special about the data:

  1. 1Chunk with recursive, structure-aware splitting at ~400 tokens and ~15% overlap. Reach for sentence-window if early evals show truncated answers.
  2. 2Embed with a current general-purpose model and store in whatever vector DB you already run. Don't agonize here; the reranker covers a lot of embedding sins.
  3. 3Retrieve the top 50 candidates. Add keyword/BM25 as a hybrid second retriever if your domain is full of exact terms (codes, SKUs, names) — dense search is weak on those.
  4. 4Rerank with a cross-encoder (a hosted API to start, BGE if you self-host) and keep the top 3–5.
  5. 5Generate, and — this is the part everyone skips — build a 30-question eval set and measure retrieval hit-rate before you blame the model for anything.
Build the pipeline, then break it

You can wire retrieval, chunking, and a reranker into an agent on the AgentSwarms canvas and watch each stage in the trace — then run the Failure-Mode Labs to see what a broken retrieval step actually looks like before it happens to you in production.

The throughline of all of this: RAG quality is decided in the boring middle of the pipeline, not at the model. Cut your text so each chunk holds one whole idea. Retrieve wide so the answer is somewhere in the candidates. Rerank narrow so the best chunk reaches the model. Add a framework only when the job is complex enough to need one. Do those four things and you'll spend a lot less time accusing your LLM of hallucinating when it was only ever answering the question you actually gave it.


Was this useful?

Comments

Sign in to join the discussion.

Loading comments…