Retrieval, Chunking, and Reranking: The Parts of RAG That Actually Decide Quality
Everyone obsesses over the model. But the answer was decided three steps earlier — in how you split the text, how you searched it, and whether you bothered to re-rank. Here's the part of the pipeline that quietly makes or breaks RAG, plus an honest take on when LlamaIndex earns its place and when it's just ceremony.
The bug report said the assistant was “hallucinating.” It wasn't. I pulled the trace, and the model had answered faithfully from the three chunks it was handed — they just happened to be the wrong three. The chunk that actually answered the question was sitting at retrieval rank 7, one slot outside the window we passed to the model. We hadn't given it a chance. We'd given it a worse problem and blamed it for failing.
I've watched a dozen teams reach for a bigger model when the real fix was upstream. RAG quality is mostly decided before the LLM ever runs — in three unglamorous steps: how you cut the text, how you find the candidates, and whether you re-rank them before stuffing the prompt. Get those right and a mid-tier model looks brilliant. Get them wrong and GPT-5 will confidently answer from garbage.
This post is about those three steps, in the order they bite you: chunking, retrieval, and reranking. Then the question everyone eventually asks — do I need LlamaIndex for this, or am I adding a framework to justify the diagram?
Retrieval is two jobs, not one
The single most useful mental model I can give you: retrieval has to do two things that pull in opposite directions. First, recall — cast a wide enough net that the right passage is somewhere in your candidates. Second, precision — make sure the few chunks you actually hand the model are the best ones, not just the on-topic ones.
A single vector search is decent at the first job and bad at the second. It will happily return 50 chunks that are all about the topic, but it's surprisingly bad at telling you which one answers the question. That's why serious RAG is almost always two-stage: a fast retriever casts the net, then a slower, sharper reranker decides the final order. Toggle the second stage on and off below and watch what reaches the model:
Retrieve wide, rerank narrow, then generate. If you only add one thing to a naïve RAG pipeline this year, add the reranker — it's the highest-leverage, lowest-effort upgrade in the stack.
Chunking: the decision you make once and regret for months
Everything downstream inherits your chunking. The embedding model can only embed what you give it; the retriever can only return chunks that exist; the reranker can only reorder what was retrieved. If you split a procedure across two chunks so neither one is complete, no amount of reranking will reassemble it. Chunking is the foundation, and like most foundations, nobody notices it until it cracks.
The core tension is simple. Small chunks match precisely — a tight passage about exactly one thing embeds into a sharp, specific vector. But small chunks are fragments: the model gets a sentence with no surrounding context. Large chunks carry their context with them, but the embedding has to average a paragraph of mixed ideas into one vector, so it matches everything weakly and nothing strongly — and it drags noise into your prompt. Drag the size around and watch both gauges fight each other:
Sweet spot. A coherent passage that stands on its own. Precise enough to match, whole enough to answer. Most prose lives here.
Sizes are only half of it — how you split matters as much as how big. The strategies I actually reach for, roughly in order of how often:
- Recursive / structure-aware — split on the document's own boundaries (headings, paragraphs, then sentences) before falling back to a character count. This is the sane default. It respects the shape of the text instead of guillotining mid-sentence.
- Sentence-window (small-to-big) — embed single sentences for precise matching, but at retrieval time return the sentence plus its neighbours. You get the precision of small chunks and the context of large ones. This one quietly fixes a lot of “the answer was half-there” complaints.
- Parent-document — index small child chunks, but feed the model the larger parent they came from. Same idea as sentence-window, coarser granularity.
- Semantic — split where the topic actually shifts (using embedding distance between adjacent sentences) rather than at a fixed length. Lovely in theory, more expensive, and worth it mainly for long, rambling documents where fixed sizes cut across ideas.
- Fixed-size with overlap — the blunt instrument. A fixed token count with ~10–20% overlap so a fact straddling a boundary survives in at least one chunk. Fine for uniform text; crude for structured docs.
If you split with zero overlap, every chunk boundary is a place where a fact can be cut in half and lost from both sides. A little overlap is cheap insurance. Zero overlap is a guaranteed class of silent retrieval misses.
There's no universal best chunk size, and anyone who quotes you one hasn't seen your documents. Dense API references want different treatment than chatty support articles. The honest workflow is: pick recursive splitting at ~400 tokens with overlap as a starting point, build a small eval set of real questions, and measure retrieval before you touch the model.
Why dense retrieval alone disappoints
Dense (vector) retrieval works by embedding your query and your chunks into the same space and grabbing the nearest neighbours. It's fast, it scales to millions of chunks, and it captures meaning that keyword search misses. It's also, on its own, a blunt ranker. Here's the failure I see constantly: the top candidates all score within a hair of each other — 0.81, 0.80, 0.79 — because the embedding can tell they're all on topic but can't tell which one actually answers the question.
Watch it happen. Below, a bi-encoder has returned seven on-topic chunks with nearly-identical scores. The one that contains the real answer is buried at rank 6. Hit the reranker:
Bi-encoder scores are bunched between 0.75 and 0.81 — it can tell these are all on-topic, but not which one actually answers.
This is not a sign your embeddings are bad. It's structural. A bi-encoder embeds the query and the document separately and never lets them interact — it's comparing two summaries from across the room. That's exactly what makes it fast enough to search millions of chunks, and exactly what makes it imprecise at the top.
What a reranker actually is
A reranker is almost always a cross-encoder, and the difference from your retriever is the whole story. A bi-encoder runs the query and a document through the model separately and compares the two output vectors. A cross-encoder concatenates them — [query + document] — and runs them through the model together, so every word of the query can attend to every word of the document. Then it emits a single relevance score. Flip between the two architectures:
Two separate towers. Document vectors are baked ahead of time, so search is just nearest-neighbour math — milliseconds over millions of chunks.
That “can't precompute it” line is the entire reason for two stages. A cross-encoder has to do a fresh forward pass for every query–document pair, so running it over your whole corpus at query time is hopeless — that's millions of forward passes per question. So you don't. You let the cheap bi-encoder narrow a million chunks down to ~50, then spend the cross-encoder's expensive attention only on those 50. You get cross-encoder precision at bi-encoder scale. That's the trick, and it's most of why reranking feels like magic the first time you add it.
Retrieve 25–100 candidates from the vector store, rerank, and keep the top 3–5 for the prompt. Retrieving too few starves the reranker (it can't promote a chunk that was never fetched). Keeping too many after reranking just re-adds the noise you paid to remove.
Popular reranker models, and when to reach for each
The good news is you rarely train these. There's a healthy market of hosted APIs and open weights, and the choice mostly comes down to four questions: how much latency you can spend, whether the data can leave your network, whether you need multiple languages, and what you're willing to pay. Pick a constraint and see what fits:
Highlighted rows fit the selected need. Most teams start with a hosted API (Cohere/Voyage) and move to BGE or MiniLM when cost or data-privacy says self-host.
The version I give people who don't want to read a table:
- Just want it to work, data can leave the network? Start with Cohere Rerank or Voyage rerank. One API call, multilingual, genuinely strong. You'll have a better pipeline this afternoon.
- Need to self-host (privacy, cost, air-gapped)? BGE-reranker-v2-m3 is the strong free default — multilingual, runs on a modest GPU, no per-call bill. mxbai-rerank is a solid alternative.
- Latency-critical or CPU-only? The classic ms-marco-MiniLM cross-encoders from sentence-transformers are tiny, fast, and have been the reliable baseline for years. Not the highest ceiling, but hard to beat on cost per millisecond.
- Want a middle ground between retriever and reranker? ColBERT (late interaction) scores at the token level and lands between a bi-encoder's speed and a cross-encoder's precision — more infrastructure, but a real option at scale.
Whatever you pick, the integration shape is the same — score the candidates, sort, truncate:
# Stage 1: dense retrieval casts the wide net (recall)
candidates = vector_store.search(query, top_k=50)
# Stage 2: cross-encoder reranks for precision
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [(query, c.text) for c in candidates]
scores = reranker.predict(pairs) # one forward pass per pair
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
# Keep only the best few for the prompt
context = [c for c, _ in ranked[:4]]Forty lines, no framework, and it's the biggest quality jump most RAG systems will ever get. Which is a good moment to talk about frameworks.
Where LlamaIndex fits — and what it actually adds
If you've only ever heard “LlamaIndex is a RAG framework,” here's the more useful framing: it's a set of opinionated abstractions over the exact seven layers you'd otherwise hand-write. Loaders, splitters, embeddings, the index, the retriever, the postprocessors (where rerankers live), and the response synthesizer. Nothing it does is impossible without it — the question is whether you want to own that code. Toggle between the raw stack and the LlamaIndex version:
Every layer is yours to build and maintain. Total control, zero indirection — and a lot more code to keep correct as requirements grow.
So what's the real difference between “normal retrieval” and “retrieval with LlamaIndex”? With the raw approach you call an embeddings API, talk to your vector DB's SDK (pgvector, Qdrant, Pinecone), and write the query, the reranker glue, and the prompt assembly yourself. With LlamaIndex, those become configured components — and, more importantly, you get the advanced retrieval patterns for free instead of building each one:
- Heterogeneous loaders — 100+ connectors via LlamaHub, so Notion + Postgres + a folder of PDFs all become the same node type without three bespoke parsers.
- Advanced retrieval out of the box — sentence-window, auto-merging, recursive/small-to-big, metadata filtering, and query transforms like HyDE or sub-question decomposition. These are exactly the patterns that are annoying to build by hand.
- Rerankers as one-line postprocessors —
CohereRerank,SentenceTransformerRerank, or an LLM reranker drop into the query engine as a node postprocessor. Swapping rerankers is a config change. - Index types beyond plain vector search — summary, tree, keyword, and property-graph indices when a flat top-k isn't the right structure for your data.
LlamaIndex's real payoff isn't the first vector search — that's a weekend either way. It's the third, fourth, and fifth retrieval pattern you'd otherwise build and maintain yourself, plus a consistent way to swap components as requirements shift.
When LlamaIndex is overkill
I like LlamaIndex. I also routinely talk people out of it. A framework is a trade: you exchange a pile of glue code for a pile of abstractions, and abstractions have a cost — indirection when you debug, a version-churn tax, and the moments where the framework's opinion fights yours and you spend an afternoon learning its way to do a thing you could have written in ten lines. That trade is worth it when complexity is genuinely high and a waste when it isn't. Move the scenarios around:
A single source and a plain top-k query. A vector-DB SDK and 40 lines do this; LlamaIndex is overkill.
Concretely, skip the framework when most of these are true:
- You have one source and one retrieval pattern — a single pgvector table, plain top-k. A vector-DB SDK and ~40 lines (including the reranker above) cover it with less surface area than the framework's config.
- Your corpus is small and fairly static — a few hundred to a few thousand chunks. Honestly, plain dense retrieval plus a reranker is often all you need; the fancy index types solve problems you don't have.
- You need tight control over latency or the exact prompt — frameworks add layers between you and the wire, and that's the last place you want surprises in a hot path.
- Your team will maintain this for years and values reading plain code over learning a framework's release notes.
And reach for it when the opposite is true: many heterogeneous sources, retrieval that needs routing or multiple hops, a fast-moving prototype where you're trying five patterns this week, or a team that would rather configure than build. The mistake in both directions is the same — choosing the tool before you've described the job.
A default recipe that holds up
If you want somewhere to start that won't embarrass you in production, this is the stack I reach for before I know anything special about the data:
- 1Chunk with recursive, structure-aware splitting at ~400 tokens and ~15% overlap. Reach for sentence-window if early evals show truncated answers.
- 2Embed with a current general-purpose model and store in whatever vector DB you already run. Don't agonize here; the reranker covers a lot of embedding sins.
- 3Retrieve the top 50 candidates. Add keyword/BM25 as a hybrid second retriever if your domain is full of exact terms (codes, SKUs, names) — dense search is weak on those.
- 4Rerank with a cross-encoder (a hosted API to start, BGE if you self-host) and keep the top 3–5.
- 5Generate, and — this is the part everyone skips — build a 30-question eval set and measure retrieval hit-rate before you blame the model for anything.
You can wire retrieval, chunking, and a reranker into an agent on the AgentSwarms canvas and watch each stage in the trace — then run the Failure-Mode Labs to see what a broken retrieval step actually looks like before it happens to you in production.
The throughline of all of this: RAG quality is decided in the boring middle of the pipeline, not at the model. Cut your text so each chunk holds one whole idea. Retrieve wide so the answer is somewhere in the candidates. Rerank narrow so the best chunk reaches the model. Add a framework only when the job is complex enough to need one. Do those four things and you'll spend a lot less time accusing your LLM of hallucinating when it was only ever answering the question you actually gave it.
Further reading & references
Was this useful?
Comments
Loading comments…