Retrieval-Augmented Generation (RAG): the practical guide to building grounded LLM apps

Large Language Models are great at fluent text - but they’re not great at being consistently correct, up-to-date, or able to cite where an answer came from. Retrieval-Augmented Generation (RAG) is the most common pattern for fixing that: fetch relevant knowledge first, then generate an answer grounded in it.
This post explains:
what RAG is (and isn’t),
the core building blocks,
the “modern” upgrades that actually determine quality,
and how to evaluate + debug RAG like an engineer.
What RAG is
RAG = Retriever + Generator
Instead of forcing the LLM to “remember everything”, you:
retrieve the most relevant chunks from your knowledge base (docs, PDFs, wiki, tickets, code, etc.)
inject those chunks into the prompt
generate an answer conditioned on that evidence (ideally with citations)
The canonical pipeline
This basic idea was formalized in the original RAG paper by Lewis et al. (2020). (arXiv:2005.11401)
The 4 core building blocks
1. Chunking (how you split knowledge)
Chunking is the step where you split your source documents into retrieval units (“chunks”) that will be embedded, indexed, and later fed to the LLM as evidence. It’s not just a preprocessing detail - it directly shapes recall, precision, and even whether the model can cite the right passage. Good chunking usually respects natural document structure (headings, paragraphs, code blocks, tables) and aims to keep each chunk semantically coherent so a single retrieved piece can stand on its own.
Chunking controls what retrieval can “see.”
Too small → missing context
Too big → irrelevant junk + token waste
Common baseline: 300-800 tokens, with 10–20% overlap, then iterate.
2. Embeddings (how you represent text for search)
Embeddings are how we turn text into numbers that capture meaning, so we can do “semantic search” instead of pure keyword search. Concretely, an embedding model maps a query and each chunk into vectors, and retrieval becomes “find the chunks whose vectors are closest to the query.” This matters because real users rarely use the same wording as your docs - embeddings help match paraphrases, concepts, and intent (e.g., “cancel subscription” ↔ “terminate plan”). Your RAG quality often lives or dies on whether your embedding model preserves the right signals for your content (definitions, procedures, code, product names, etc.).
3. Retrieval (how you fetch candidates)
Two major retrieval families show up in almost every RAG system, and they behave differently because they’re matching different signals:
a. Sparse retrieval (BM25 / keyword search)
BM25 is a classic information-retrieval scoring method used by many search engines. It ranks documents mainly by keyword overlap (with smart weighting so rare/important words count more than common ones like “the” or “and”). It’s great when the user’s query contains exact tokens you must match:
IDs (
INC-12491)error codes (
CUDA_ERROR_ILLEGAL_ADDRESS)version numbers (
v2.3.1)names or exact phrases
Easy way to remember: BM25 is like Ctrl+F on steroids - it finds the pages that literally contain what you typed, and ranks the best matches first.
b. Dense retrieval (vector / embedding search)
Dense retrieval turns text into embeddings (vectors) so it can match meaning, not exact wording. This helps when users and docs say the same thing differently:
“cancel subscription” ↔ “terminate plan”
“model is slow” ↔ “high latency inference”
“add login” ↔ “implement authentication”
Easy way to remember: dense retrieval is like asking, “Which passages talk about the same idea?” even if they don’t share the same words.
c. Why you often want both (hybrid retrieval)
Real user queries mix both:
“What does error 0x80070005 mean?” (exact token + intent)
“Why did performance drop after upgrading?” (semantic, broad)
So many production RAG systems use hybrid retrieval:
BM25 to catch exact anchors (codes, names, IDs)
Dense retrieval to catch paraphrases and concepts
a reranker to sort the best few (we will cover this below)
4. Generation (how you answer from evidence)
Generation is where the LLM turns retrieved chunks into a final response. The goal isn’t just fluent text - it’s grounded answers: the model should respond based on the evidence you retrieved, not its own assumptions. Think of the retrieved chunks as “open book notes” the model is allowed to use. If the notes don’t contain the answer, the correct behavior is to say so.
To make generation reliable, add three constraints:
Citations (chunk IDs): Require citations like [C2] or [DocA#12] for each major claim. This keeps answers auditable and discourages hallucinations.
Refusal when evidence is missing: If the answer isn’t supported by the retrieved context, the model should explicitly say “I don’t know from these sources” and suggest what to retrieve next.
A short “What I used” section: End with a compact list of the chunks that actually supported the response. This makes debugging easy (“retrieval failed” vs “model ignored evidence”).
Example output format
Answer: … [C1][C4]
Key evidence:
… [C1]
… [C4]
What I used: C1, C4
If missing: “Not enough evidence in the provided context to answer.”
Prompt snippet
“Use only the provided context. If unsupported, say you don’t know.”
“Cite chunk IDs like [C#] for each key claim.”
“End with What I used: list of chunk IDs.”
The “modern RAG” upgrades that really matter
Once you have a basic RAG pipeline working, most quality gains come from a few “systems” upgrades. These aren’t fancy extras - they’re the difference between mostly relevant and consistently useful.
Upgrade A: Re-ranking (precision booster)
Vector search is great at pulling a decent candidate set, but the top results often include “near misses.” A re-ranker takes the top-N retrieved chunks (e.g., 20–100) and sorts them using a stronger relevance signal, so the final top-K context is much tighter and easier for the LLM to ground on.
Late-interaction retrieval like ColBERT improved precision by scoring query–document similarity at a finer (token) level. (arXiv:2004.12832)
ColBERTv2 made late interaction more practical by reducing the storage footprint while improving quality. (arXiv:2112.01488)
In practice, lightweight general rerankers (e.g., BGE rerankers) are often the easiest win because they’re drop-in and noticeably improve top-k relevance.
Key takeaway: Retrieve wide → rerank narrow → prompt only the best.
Upgrade B: Multi-query retrieval (coverage booster)
Single queries are brittle: users omit keywords, use unusual phrasing, or ask multi-part questions. Multi-query retrieval generates a few query variations (rephrasings, sub-questions), retrieves for each, then fuses results into one stronger candidate pool.
- RAG-Fusion is a popular approach: generate multiple queries, retrieve for each, then merge rankings using Reciprocal Rank Fusion (RRF). (arXiv:2402.03367)
Easy way to remember: Ask the search engine the same question 5 different ways, then combine the best hits.
Upgrade C: “Retrieve only when needed” (reduce noise + cost)
Not every question benefits from retrieval (e.g., “Explain RAG” or “Rewrite this paragraph”). Always retrieving can add irrelevant context, increase latency, and sometimes even reduce answer quality. For events like those, use conditional retrieval: only fetch evidence when the question actually depends on external knowledge.
- Self-RAG trains the model to decide when to retrieve, judge relevance, and critique whether evidence supports the answer. (arXiv:2310.11511)
Easy way to remember: If it’s about your private docs → retrieve. If it’s general knowledge / writing help → skip.
Upgrade D: Hierarchical / structured retrieval for long docs (global context)
Chunk retrieval is “local”: it’s good at finding a paragraph, but long documents and large corpora often require global understanding (themes, policies, cross-document synthesis). Hierarchical/structured retrieval adds a higher-level index so the system can retrieve both “big picture” summaries and drill-down details.
RAPTOR builds a tree of summaries and retrieves at multiple abstraction levels (high-level + detailed). (arXiv:2401.18059)
GraphRAG builds an entity/relationship graph and uses community summaries to answer broad corpus-wide questions. (arXiv:2404.16130)
Easy way to remember: Don’t just retrieve paragraphs - retrieve the map of the corpus too.
Failure modes (and how to debug them)
Even strong RAG systems fail in predictable ways. The fastest way to improve is to identify which stage broke (retrieval, ranking, context packing, or generation) and apply a targeted fix.
1. Good retrieval, bad answer (the model ignores evidence)
What it looks like: the retrieved chunks clearly contain the answer, but the model responds with something generic, incomplete, or incorrect.
Common causes:
the prompt doesn’t strongly enforce “use only context”
too much context noise buries the relevant lines
the truly relevant chunk is ranked low and gets truncated out
Fixes:
Constrain the prompt: “Answer using ONLY the provided context. If it’s not there, say you don’t know.”
Shorten / clean the context: reduce top-K, remove duplicates, or use tighter chunking so evidence is easier to spot
Add a reranker: retrieve top-N broadly, rerank, then keep only top-K for generation
2. Bad retrieval, confident answer (hallucination)
What it looks like: retrieved chunks don’t support the answer (or are irrelevant), but the model answers confidently anyway.
Fixes:
Score thresholding: if retrieval scores are low (or relevance is uncertain), return “I don’t know from these sources” instead of guessing
Require citations for claims: force the model to cite chunk IDs per claim; uncited claims are treated as invalid
Use query rewriting / HyDE: rewrite the query into something more “retrieval-friendly,” especially when the user query is vague or mismatched to doc phrasing
HyDE in one sentence: generate a hypothetical “ideal” answer document, embed that, then retrieve real documents near it in embedding space. (arXiv:2212.10496)
3. Chunking mismatch (the answer is split across chunks)
What it looks like: no single chunk contains enough evidence, because key details are split across boundaries (e.g., definition in one chunk, conditions in the next).
Fixes:
Increase overlap (or slightly increase chunk size) so related details stay together
Use structure-aware chunking: split by headings/sections, keep code blocks intact, treat tables as units
Consider hierarchical retrieval: approaches like RAPTOR retrieve both summaries and detailed leaves, helping when information is scattered across long docs. (arXiv:2401.18059)
Evaluation: how to know your RAG is improving
A common mistake is evaluating RAG only by “does the final answer look good?” That hides where things are actually failing. In practice, you want to measure the pipeline in layers. You need to separately measure:
retrieval quality (did we fetch relevant evidence?)
groundedness (did the answer stick to evidence?)
answer quality (is it helpful, complete, correct?)
RAGAS (Retrieval Augmented Generation Assessment) proposes reference-free metrics that score retrieval + faithfulness + answer quality without requiring gold labels for every query. (arXiv:2309.15217)
BEIR (Benchmarking Information Retrieval) is another widely used benchmark suite for evaluating retrieval models across many different datasets and domains - not just one curated task. The big lesson from BEIR is that retrieval models that look great on a single dataset often fail when you switch domains, query styles, or document types. That’s exactly what happens in real RAG apps: users ask messy questions, and your corpus has its own vocabulary and structure. If your RAG answer quality is inconsistent, BEIR is a good evaluation suite: your retriever might be “overfit” to one style of data and not robust to your actual workload. (arXiv:2104.08663)
A practical “starter” RAG recipe (works surprisingly well)
Chunk at ~500 tokens with ~15% overlap
Use a strong general embedding model (E5-class) (arXiv:2212.03533)
Retrieve top-30, then rerank down to top-5
Prompt with:
instruction to only use provided context
citations per sentence or per paragraph
refusal when evidence is missing
Evaluate with a small test set + RAGAS metrics (arXiv)
Add multi-query fusion when recall is the problem (arXiv)
Closing thought
RAG isn’t one trick - it’s a system. Your quality comes from how well retrieval, chunking, reranking, prompting, and evaluation work together.
Reading list: important RAG papers
Foundations
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: https://arxiv.org/abs/2005.11401
REALM: Retrieval-Augmented Language Model Pre-Training: https://arxiv.org/abs/2002.08909
Dense Passage Retrieval for Open-Domain Question Answering: https://arxiv.org/abs/2004.04906
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering: https://arxiv.org/abs/2007.01282
Retrieval quality and candidate generation
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models: https://arxiv.org/abs/2104.08663
Unsupervised Dense Information Retrieval with Contrastive Learning: https://arxiv.org/abs/2112.09118
Text Embeddings by Weakly-Supervised Contrastive Pre-training: https://arxiv.org/abs/2212.03533
Reranking and late interaction
ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT: https://arxiv.org/abs/2004.12832
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction: https://arxiv.org/abs/2112.01488
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking: https://arxiv.org/abs/2107.05720
Query rewriting, fusion, robustness
Precise Zero-Shot Dense Retrieval without Relevance Labels: https://arxiv.org/abs/2212.10496
RAG-Fusion: a New Take on Retrieval-Augmented Generation: https://arxiv.org/abs/2402.03367
Adaptive and agentic RAG
- Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection: https://arxiv.org/abs/2310.11511
Long-context corpora and structured retrieval
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval: https://arxiv.org/abs/2401.18059
From Local to Global: A Graph RAG Approach to Query-Focused Summarization: https://arxiv.org/abs/2404.16130
Evaluation
RAGAS: Automated Evaluation of Retrieval Augmented Generation: https://arxiv.org/abs/2309.15217
KILT: a Benchmark for Knowledge Intensive Language Tasks: https://arxiv.org/abs/2009.02252
