<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Sahoo Labs]]></title><description><![CDATA[Sahoo Labs explores reliable agentic AI: tools, planning, memory, and evaluation - sharing practical lessons and experiments beyond day-to-day work.]]></description><link>https://blogs.sahoo-labs.dev</link><generator>RSS for Node</generator><lastBuildDate>Sun, 31 May 2026 17:45:50 GMT</lastBuildDate><atom:link href="https://blogs.sahoo-labs.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[RepoRAG: Repo-Native RAG for Local Folders & Remote Git Repos]]></title><description><![CDATA[When I am dropped into a new codebase, I don’t want “general advice.” I want specific, grounded answers:

Where is the main LLM pipeline entry point?

How does the end-to-end inference flow work?

Where is retrieval feature implemented?


The problem...]]></description><link>https://blogs.sahoo-labs.dev/reporag-repo-native-rag-for-local-folders-and-remote-git-repos</link><guid isPermaLink="true">https://blogs.sahoo-labs.dev/reporag-repo-native-rag-for-local-folders-and-remote-git-repos</guid><category><![CDATA[generative ai]]></category><category><![CDATA[llm]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[openai]]></category><dc:creator><![CDATA[Himanshu Shekhar Sahoo]]></dc:creator><pubDate>Mon, 02 Feb 2026 18:09:53 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770055824763/5bded5aa-a172-41ac-b4dd-a097025936f8.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I am dropped into a new codebase, I don’t want “general advice.” I want specific, grounded answers:</p>
<ol>
<li><p>Where is the main LLM pipeline entry point?</p>
</li>
<li><p>How does the end-to-end inference flow work?</p>
</li>
<li><p>Where is retrieval feature implemented?</p>
</li>
</ol>
<p>The problem is that codebases are large, fast-moving, and full of local conventions. Even good LLMs can hallucinate or give high-level answers unless you force them to retrieve evidence.</p>
<p><strong>RepoRAG</strong> is my attempt at a pragmatic middle ground: a small CLI that builds a <strong>FAISS vector index</strong> from a local folder or a remote Git repo, then answers questions with <strong>retrieval-augmented generation</strong> and <strong>file-path citations</strong> so you can verify quickly.</p>
<p>This post explains the design, the tradeoffs, and what I learned building a repo-native RAG tool that’s meant to be practical, not magical.</p>
<h2 id="heading-what-reporag-is-and-what-it-isnt"><strong>What RepoRAG is (and what it isn’t)</strong></h2>
<p>RepoRAG is a CLI that:</p>
<ol>
<li><p>Ingests a local folder into a FAISS index</p>
</li>
<li><p>Ingests a remote Git repo (clone → index → delete clone)</p>
</li>
<li><p>Answers questions against the saved index with grounded answers + source file paths</p>
</li>
<li><p>Uses file filters + smaller embedding models to speed up large repo ingest</p>
</li>
</ol>
<p>It’s not trying to be a full agent framework, a PR reviewer, or an IDE plugin (yet). It’s intentionally a simple workflow you can run in a terminal when you need answers that are traceable back to the repo.</p>
<h2 id="heading-why-file-path-citations-are-the-killer-feature">Why file-path citations are the “killer feature”</h2>
<p>A lot of RAG demos show citations as chunks of text. For code, that’s not enough. What I actually want is:</p>
<ol>
<li><p>a short answer, and</p>
</li>
<li><p>a shortlist of files that I should open next.</p>
</li>
</ol>
<p>RepoRAG is optimized around that: answers are grounded and include source file paths (so I can jump into the code immediately). This changes the UX from “trust the model” to “use the model to triage where to look.”</p>
<h2 id="heading-high-level-architecture">High-Level Architecture</h2>
<p>At a high level, RepoRAG is the standard RAG loop applied to code:</p>
<ol>
<li><p><strong>Collect documents</strong> (files from a local folder or a cloned repo)</p>
</li>
<li><p><strong>Chunk</strong> documents</p>
</li>
<li><p><strong>Embed</strong> chunks</p>
</li>
<li><p>Store in <strong>FAISS</strong></p>
</li>
<li><p>On question: <strong>retrieve top-k chunks</strong></p>
</li>
<li><p><strong>Generate</strong> an answer constrained by retrieved context</p>
</li>
<li><p>Return answer + <strong>file-path citations</strong></p>
</li>
</ol>
<h2 id="heading-repo-structure-and-why-it-matters">Repo structure and why it matters</h2>
<p>RepoRAG keeps the layout intentionally small and readable. The project structure (as documented) separates the CLI entry point, configuration, ingestion, retrieval, and remote handling.</p>
<ul>
<li><p><code>rag_</code><a target="_blank" href="http://cli.py"><code>cli.py</code></a> – CLI entry point</p>
</li>
<li><p><code>reporag/</code> – config, loaders, ingest, rag, remote</p>
</li>
<li><p><code>.reporag_index/</code> – this is the vector index, created after ingest</p>
</li>
</ul>
<h2 id="heading-workflow-local-ingest-remote-ingest-and-ask">Workflow: local ingest, remote ingest, and ask</h2>
<p>RepoRAG exposes three core user actions:</p>
<h3 id="heading-1-index-a-local-folder">1. Index a local folder</h3>
<pre><code class="lang-python">python3 rag_cli.py ingest .
</code></pre>
<p>This is the “index whatever is here” workflow.</p>
<h3 id="heading-2-index-a-remote-repo-clone-index-delete-clone">2. Index a remote repo (clone → index → delete clone)</h3>
<pre><code class="lang-python">python3 rag_cli.py ingest-remote https://github.com/langgenius/dify.git --depth <span class="hljs-number">1</span>
</code></pre>
<p>This is useful when you want to interrogate a repo you haven’t checked out. RepoRAG’s documented behavior is to delete the cloned repo directory after indexing. <code>--depth 1</code> means a <strong>shallow clone</strong> of the repo with only the <strong>latest commit history</strong> (roughly: the newest snapshot). It downloads far less git history, so cloning is faster and smaller.</p>
<h3 id="heading-3-ask-questions-with-k">3) Ask questions (with <code>k</code>)</h3>
<pre><code class="lang-python">python3 rag_cli.py ask <span class="hljs-string">"Where is the LLM entry point?"</span>
python3 rag_cli.py ask <span class="hljs-string">"How is authentication handled end-to-end?"</span> --k <span class="hljs-number">10</span>
</code></pre>
<p><code>--k 10</code> is the <strong>number of retrieved chunks</strong> RepoRAG pulls from the FAISS index to use as context for answering. The <code>--k</code> control matters: it’s the simplest knob for trading off recall vs noise.</p>
<h2 id="heading-provider-abstraction-openai-and-local-options">Provider abstraction: OpenAI and local options</h2>
<p>A core design goal is portability: the LLM + embeddings provider can be swapped via <code>reporag/</code><a target="_blank" href="http://config.py"><code>config.py</code></a>, and the examples use OpenAI but you can swap providers via config/env.</p>
<p>An example <code>.env</code> using OpenAI includes:</p>
<ul>
<li><p><code>LLM_PROVIDER=openai</code></p>
</li>
<li><p><code>OPENAI_API_KEY=...</code></p>
</li>
<li><p><code>OPENAI_EMBED_MODEL=text-embedding-3-small</code></p>
</li>
<li><p><code>optional OPENAI_EMBED_DIM=768</code></p>
</li>
<li><p><code>OPENAI_CHAT_MODEL=gpt-4o-mini</code></p>
</li>
<li><p><code>INDEX_DIR=.reporag_index</code></p>
</li>
</ul>
<p>Even if you run locally (Ollama, etc.), this split stays the same conceptually: you want an embeddings model that’s fast and a chat model that’s good at “answer from evidence.”</p>
<h2 id="heading-the-real-engineering-decisions-in-reporag">The real engineering decisions in RepoRAG</h2>
<p>Most RepoRAG concepts live or die on boring details. Here are the big ones.</p>
<h3 id="heading-1-chunking-strategy-code-isnt-prose">1. Chunking strategy: code isn’t prose</h3>
<p>Chunking code is tricky because semantics are often non-local:</p>
<ul>
<li><p>A function signature in one file, implementation elsewhere</p>
</li>
<li><p>A config key defined in a <code>.yaml</code> and read in Python</p>
</li>
<li><p>Auth flows spanning middleware + routes + handlers</p>
</li>
</ul>
<p>A naive chunking strategy can either:</p>
<ul>
<li><p>fragment important context, or</p>
</li>
<li><p>create huge chunks that embed slowly and retrieve noisily</p>
</li>
</ul>
<p>RepoRAG explicitly calls out chunk size/overlap as a performance lever (increase chunk size / reduce overlap to reduce total chunks). That’s not just performance - changing chunking affects <em>retrieval accuracy</em>. Bigger chunks can improve “contain the full idea,” but can also increase irrelevant matches.</p>
<p><strong>Practical heuristic:</strong> start with conservative chunking (smaller, more overlap) for correctness; once it works, tune chunking for speed.</p>
<h3 id="heading-2-retrieval-k-the-simplest-knob-that-matters">2. Retrieval <code>k</code>: the simplest knob that matters</h3>
<p>RepoRAG supports increasing <code>k</code> (example shows <code>--k 10</code>).</p>
<ul>
<li><p>Low <code>k</code> → cleaner context, but higher chance you miss the key file</p>
</li>
<li><p>High <code>k</code> → better recall, but more distractions (and longer prompts)</p>
</li>
</ul>
<p>In practice, I like:</p>
<ul>
<li><p><code>k=4–6</code> for “where is X implemented?”</p>
</li>
<li><p><code>k=8–12</code> for “explain end-to-end flow”</p>
</li>
</ul>
<h3 id="heading-3-index-location-and-reproducibility">3. Index location and reproducibility</h3>
<p>RepoRAG stores the index under a configurable directory (by default <code>INDEX_DIR=.reporag_index</code>).</p>
<p>That is a good default because:</p>
<ul>
<li><p>it keeps the index near the repo</p>
</li>
<li><p>it’s easy to delete and rebuild</p>
</li>
<li><p>it doesn’t pollute global state</p>
</li>
</ul>
<p><strong>Note:</strong> You would need to rebuild the index when embedding model or dimensions change (delete <code>.reporag_index</code> and re-ingest). This is critical as embedding dimension mismatches will break retrieval in subtle ways.</p>
<h3 id="heading-4-large-repo-runtime-performance-embeddings-dominate">4. Large repo runtime performance: embeddings dominate</h3>
<p>For large repos, embedding time dominates ingestion.</p>
<p>Suggested optimizations include:</p>
<ul>
<li><p>Use a smaller embedding model (example: <code>text-embedding-3-small</code>)</p>
</li>
<li><p>Optionally reduce embedding dimensions (example: <code>OPENAI_EMBED_DIM=768</code>)</p>
</li>
<li><p>Restrict indexing to key folders (example list: <code>["api","web","docs"]</code>)</p>
</li>
<li><p>Tune chunk size / overlap to reduce chunk count</p>
</li>
</ul>
<h2 id="heading-security-your-index-is-sensitive">Security: your index is sensitive</h2>
<p>RepoRAG includes a security note that’s easy to ignore but extremely important:</p>
<ul>
<li><p><code>ingest-remote</code> argument deletes the cloned repo directory after indexing</p>
</li>
<li><p><strong>but</strong> the FAISS index contains embedded chunks of text used for retrieval</p>
</li>
<li><p>treat the index as sensitive if the repo is private</p>
</li>
</ul>
<p>In practice, this implies:</p>
<ul>
<li><p>Don’t casually upload <code>.reporag_index</code> to public storage</p>
</li>
<li><p>Don’t email the index around</p>
</li>
<li><p>Consider excluding it from git</p>
</li>
<li><p>Consider encrypting at rest if you store it somewhere shared</p>
</li>
</ul>
<h2 id="heading-what-good-looks-like-grounded-answers-fast-verification">What “good” looks like: grounded answers + fast verification</h2>
<p>When RepoRAG is working well, the interaction looks like:</p>
<ol>
<li><p>I ask a question</p>
</li>
<li><p>RepoRAG returns a short explanation</p>
</li>
<li><p>It gives me <strong>a few file paths</strong></p>
</li>
<li><p>I open those files and confirm quickly</p>
</li>
</ol>
<p>These file-path citations are important. They turn the LLM from an oracle into a <strong>navigation assistant</strong>.</p>
<h2 id="heading-failure-modes-i-expect-and-how-i-think-about-them">Failure modes I expect (and how I think about them)</h2>
<p>Even without fancy evaluation harnesses, RepoRAG systems fail in predictable ways:</p>
<h3 id="heading-1-the-index-doesnt-contain-it">1. “The index doesn’t contain it”</h3>
<ul>
<li><p>You filtered out the folder that contains the answer so as to trade-off ingestion speed for performance</p>
</li>
<li><p>The file type wasn’t loaded</p>
</li>
<li><p>The repo changed since ingest</p>
</li>
</ul>
<p><strong>Fix:</strong> re-ingest; relax filters; ensure key folders are included.</p>
<h3 id="heading-2-retrieved-chunks-are-close-but-not-the-answer">2. “Retrieved chunks are close, but not the answer”</h3>
<p>Common with:</p>
<ul>
<li><p>generic naming (<a target="_blank" href="http://utils.py"><code>utils.py</code></a>, <code>helpers.ts</code>)</p>
</li>
<li><p>repeated patterns (auth middleware in multiple apps)</p>
</li>
<li><p>generated code</p>
</li>
</ul>
<p><strong>Fix:</strong> increase <code>k</code>; tighten chunking; add metadata signals (future improvement).</p>
<h3 id="heading-3-llm-answers-beyond-the-evidence">3. “LLM answers beyond the evidence”</h3>
<p>This is the classic RAG hallucination problem.</p>
<p><strong>Fix:</strong> prompt contract: <em>answer only from retrieved context; if insufficient, say so and cite what you do have.</em></p>
<h2 id="heading-what-i-would-improve-next-if-i-keep-iterating">What I would improve next (if I keep iterating)</h2>
<p>These are not any commitments - just the natural next steps for RepoRAG:</p>
<ol>
<li><p><strong>Evaluation loop</strong>: a small curated Q/A set per repo (“golden questions”) + regression checks</p>
</li>
<li><p><strong>Hybrid retrieval</strong>: BM25 + vectors (code identifiers and exact strings matter a lot)</p>
</li>
<li><p><strong>Better citations</strong>: include file path + line ranges when possible</p>
</li>
<li><p><strong>Incremental indexing</strong>: only re-embed changed files</p>
</li>
<li><p><strong>Tracing</strong>: log retrieval hits, chunk IDs, prompt length, latency (so tuning becomes data-driven)</p>
</li>
</ol>
<h2 id="heading-try-it-yourself">Try it yourself</h2>
<p>RepoRAG is on GitHub:<br /><a target="_blank" href="https://github.com/hisahoo009/RepoRAG">https://github.com/hisahoo009/RepoRAG</a></p>
<h3 id="heading-quick-start-local-folder">Quick start (local folder)</h3>
<pre><code class="lang-python">git clone https://github.com/hisahoo009/RepoRAG.git
cd RepoRAG

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt

cp .env.example .env

python3 rag_cli.py ingest .
python3 rag_cli.py ask <span class="hljs-string">"Where is the backend entry point?"</span>
</code></pre>
<h3 id="heading-index-a-remote-repo-example">Index a remote repo (example)</h3>
<pre><code class="lang-python">python3 rag_cli.py ingest-remote https://github.com/OpenPipe/ART.git --depth <span class="hljs-number">1</span>
python3 rag_cli.py ask <span class="hljs-string">"How do I run ART ?"</span>
</code></pre>
]]></content:encoded></item><item><title><![CDATA[Retrieval-Augmented Generation (RAG): the practical guide to building grounded LLM apps]]></title><description><![CDATA[Large Language Models are great at fluent text - but they’re not great at being consistently correct, up-to-date, or able to cite where an answer came from. Retrieval-Augmented Generation (RAG) is the most common pattern for fixing that: fetch releva...]]></description><link>https://blogs.sahoo-labs.dev/retrieval-augmented-generation-rag-the-practical-guide-to-building-grounded-llm-apps</link><guid isPermaLink="true">https://blogs.sahoo-labs.dev/retrieval-augmented-generation-rag-the-practical-guide-to-building-grounded-llm-apps</guid><category><![CDATA[RAG ]]></category><category><![CDATA[rag chatbot]]></category><category><![CDATA[llm]]></category><category><![CDATA[generative ai]]></category><dc:creator><![CDATA[Himanshu Shekhar Sahoo]]></dc:creator><pubDate>Fri, 30 Jan 2026 23:16:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769814810322/1069b88c-5c03-4895-a707-f3336900f31d.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Large Language Models are great at fluent text - but they’re not great at being <em>consistently correct</em>, <em>up-to-date</em>, or <em>able to cite where an answer came from</em>. Retrieval-Augmented Generation (RAG) is the most common pattern for fixing that: <strong>fetch relevant knowledge first, then generate an answer grounded in it</strong>.</p>
<p>This post explains:</p>
<ul>
<li><p>what RAG is (and isn’t),</p>
</li>
<li><p>the core building blocks,</p>
</li>
<li><p>the “modern” upgrades that actually determine quality,</p>
</li>
<li><p>and how to evaluate + debug RAG like an engineer.</p>
</li>
</ul>
<h2 id="heading-what-rag-is">What RAG is</h2>
<p><strong>RAG = Retriever + Generator</strong></p>
<p>Instead of forcing the LLM to “remember everything”, you:</p>
<ol>
<li><p><strong>retrieve</strong> the most relevant chunks from your knowledge base (docs, PDFs, wiki, tickets, code, etc.)</p>
</li>
<li><p><strong>inject</strong> those chunks into the prompt</p>
</li>
<li><p><strong>generate</strong> an answer <em>conditioned on that evidence</em> (ideally with citations)</p>
</li>
</ol>
<h3 id="heading-the-canonical-pipeline">The canonical pipeline</h3>
<pre><code class="lang-mermaid">flowchart LR
  Q[User Query] --&gt; R[Retriever]
  R --&gt;|Top-k chunks| C[Context Builder]
  C --&gt; P[Prompt]
  P --&gt; L[LLM]
  L --&gt; A[Answer + Citations]

  classDef step fill:#f5f5f5,stroke:#333,stroke-width:1px,rx:8,ry:8;
  class Q,R,C,P,L,A step;
</code></pre>
<p>This basic idea was formalized in the original RAG paper by Lewis et al. (2020). (<a target="_blank" href="https://arxiv.org/abs/2005.11401">arXiv:2005.11401</a>)</p>
<h2 id="heading-the-4-core-building-blocks">The 4 core building blocks</h2>
<h3 id="heading-1-chunking-how-you-split-knowledge">1. Chunking (how you split knowledge)</h3>
<p>Chunking is the step where you split your source documents into retrieval units (“chunks”) that will be embedded, indexed, and later fed to the LLM as evidence. It’s not just a preprocessing detail - it directly shapes recall, precision, and even whether the model can cite the right passage. Good chunking usually respects natural document structure (headings, paragraphs, code blocks, tables) and aims to keep each chunk semantically coherent so a single retrieved piece can stand on its own.</p>
<p>Chunking controls what retrieval can “see.”</p>
<ul>
<li><p>Too small → missing context</p>
</li>
<li><p>Too big → irrelevant junk + token waste</p>
</li>
<li><p>Common baseline: <strong>300-800 tokens</strong>, with <strong>10–20% overlap</strong>, then iterate.</p>
</li>
</ul>
<h3 id="heading-2-embeddings-how-you-represent-text-for-search">2. Embeddings (how you represent text for search)</h3>
<p>Embeddings are how we turn text into numbers that capture meaning, so we can do “semantic search” instead of pure keyword search. Concretely, an embedding model maps a query and each chunk into vectors, and retrieval becomes “find the chunks whose vectors are closest to the query.” This matters because real users rarely use the same wording as your docs - embeddings help match paraphrases, concepts, and intent (e.g., “cancel subscription” ↔ “terminate plan”). Your RAG quality often lives or dies on whether your embedding model preserves the right signals for your content (definitions, procedures, code, product names, etc.).</p>
<h3 id="heading-3-retrieval-how-you-fetch-candidates">3. Retrieval (how you fetch candidates)</h3>
<p>Two major retrieval families show up in almost every RAG system, and they behave differently because they’re matching <strong>different signals</strong>:</p>
<p><strong>a. Sparse retrieval (BM25 / keyword search)</strong></p>
<p><strong>BM25</strong> is a classic information-retrieval scoring method used by many search engines. It ranks documents mainly by <strong>keyword overlap</strong> (with smart weighting so rare/important words count more than common ones like “the” or “and”). It’s great when the user’s query contains <strong>exact tokens</strong> you must match:</p>
<ul>
<li><p>IDs (<code>INC-12491</code>)</p>
</li>
<li><p>error codes (<code>CUDA_ERROR_ILLEGAL_ADDRESS</code>)</p>
</li>
<li><p>version numbers (<code>v2.3.1</code>)</p>
</li>
<li><p>names or exact phrases</p>
</li>
</ul>
<p><strong>Easy way to remember:</strong> BM25 is like <em>Ctrl+F on steroids</em> - it finds the pages that literally contain what you typed, and ranks the best matches first.</p>
<p><strong>b. Dense retrieval (vector / embedding search)</strong></p>
<p>Dense retrieval turns text into <strong>embeddings</strong> (vectors) so it can match <strong>meaning</strong>, not exact wording. This helps when users and docs say the same thing differently:</p>
<ul>
<li><p>“cancel subscription” ↔ “terminate plan”</p>
</li>
<li><p>“model is slow” ↔ “high latency inference”</p>
</li>
<li><p>“add login” ↔ “implement authentication”</p>
</li>
</ul>
<p><strong>Easy way to remember:</strong> dense retrieval is like asking, <em>“Which passages talk about the same idea?”</em> even if they don’t share the same words.</p>
<p><strong>c. Why you often want both (hybrid retrieval)</strong></p>
<p>Real user queries mix both:</p>
<ul>
<li><p>“What does error <strong>0x80070005</strong> mean?” (exact token + intent)</p>
</li>
<li><p>“Why did performance drop after upgrading?” (semantic, broad)</p>
</li>
</ul>
<p>So many production RAG systems use <strong>hybrid retrieval</strong>:</p>
<ul>
<li><p>BM25 to catch exact anchors (codes, names, IDs)</p>
</li>
<li><p>Dense retrieval to catch paraphrases and concepts</p>
</li>
<li><p>a reranker to sort the best few (we will cover this below)</p>
</li>
</ul>
<h3 id="heading-4-generation-how-you-answer-from-evidence">4. Generation (how you answer from evidence)</h3>
<p>Generation is where the LLM turns retrieved chunks into a final response. The goal isn’t just fluent text - it’s <strong>grounded answers</strong>: the model should respond based on the evidence you retrieved, not its own assumptions. Think of the retrieved chunks as “open book notes” the model is allowed to use. If the notes don’t contain the answer, the correct behavior is to say so.</p>
<p>To make generation reliable, add three constraints:</p>
<ul>
<li><p><strong>Citations (chunk IDs):</strong> Require citations like <strong>[C2]</strong> or <strong>[DocA#12]</strong> for each major claim. This keeps answers auditable and discourages hallucinations.</p>
</li>
<li><p><strong>Refusal when evidence is missing:</strong> If the answer isn’t supported by the retrieved context, the model should explicitly say “I don’t know from these sources” and suggest what to retrieve next.</p>
</li>
<li><p><strong>A short “What I used” section:</strong> End with a compact list of the chunks that actually supported the response. This makes debugging easy (“retrieval failed” vs “model ignored evidence”).</p>
</li>
</ul>
<p><strong>Example output format</strong></p>
<ul>
<li><p><strong>Answer:</strong> … <strong>[C1][C4]</strong></p>
</li>
<li><p><strong>Key evidence:</strong></p>
<ul>
<li><p>… <strong>[C1]</strong></p>
</li>
<li><p>… <strong>[C4]</strong></p>
</li>
</ul>
</li>
<li><p><strong>What I used:</strong> C1, C4</p>
</li>
<li><p><strong>If missing:</strong> “Not enough evidence in the provided context to answer.”</p>
</li>
</ul>
<p><strong>Prompt snippet</strong></p>
<ul>
<li><p>“Use <strong>only</strong> the provided context. If unsupported, say you don’t know.”</p>
</li>
<li><p>“Cite chunk IDs like <strong>[C#]</strong> for each key claim.”</p>
</li>
<li><p>“End with <strong>What I used:</strong> list of chunk IDs.”</p>
</li>
</ul>
<h2 id="heading-the-modern-rag-upgrades-that-really-matter">The “modern RAG” upgrades that really matter</h2>
<p>Once you have a basic RAG pipeline working, most quality gains come from a few “systems” upgrades. These aren’t fancy extras - they’re the difference between <em>mostly relevant</em> and <em>consistently useful</em>.</p>
<h3 id="heading-upgrade-a-re-ranking-precision-booster">Upgrade A: Re-ranking (precision booster)</h3>
<p>Vector search is great at pulling a <em>decent</em> candidate set, but the top results often include “near misses.” A <strong>re-ranker</strong> takes the top-N retrieved chunks (e.g., 20–100) and sorts them using a stronger relevance signal, so the final top-K context is much tighter and easier for the LLM to ground on.</p>
<ul>
<li><p>Late-interaction retrieval like <strong>ColBERT</strong> improved precision by scoring query–document similarity at a finer (token) level. (<a target="_blank" href="https://arxiv.org/abs/2004.12832">arXiv:2004.12832</a>)</p>
</li>
<li><p><strong>ColBERTv2</strong> made late interaction more practical by reducing the storage footprint while improving quality. (<a target="_blank" href="https://arxiv.org/abs/2112.01488">arXiv:2112.01488</a>)</p>
</li>
<li><p>In practice, lightweight general rerankers (e.g., <a target="_blank" href="https://github.com/FlagOpen/FlagEmbedding"><strong>BGE rerankers</strong></a>) are often the easiest win because they’re drop-in and noticeably improve top-k relevance.</p>
</li>
</ul>
<p><strong>Key takeaway:</strong> <em>Retrieve wide → rerank narrow → prompt only the best.</em></p>
<h3 id="heading-upgrade-b-multi-query-retrieval-coverage-booster">Upgrade B: Multi-query retrieval (coverage booster)</h3>
<p>Single queries are brittle: users omit keywords, use unusual phrasing, or ask multi-part questions. Multi-query retrieval generates a few query variations (rephrasings, sub-questions), retrieves for each, then <strong>fuses</strong> results into one stronger candidate pool.</p>
<ul>
<li><strong>RAG-Fusion</strong> is a popular approach: generate multiple queries, retrieve for each, then merge rankings using <strong>Reciprocal Rank Fusion (RRF)</strong>. (<a target="_blank" href="https://arxiv.org/abs/2402.03367">arXiv:2402.03367)</a></li>
</ul>
<p><strong>Easy way to remember:</strong> <em>Ask the search engine the same question 5 different ways, then combine the best hits.</em></p>
<h3 id="heading-upgrade-c-retrieve-only-when-needed-reduce-noise-cost">Upgrade C: “Retrieve only when needed” (reduce noise + cost)</h3>
<p>Not every question benefits from retrieval (e.g., “Explain RAG” or “Rewrite this paragraph”). Always retrieving can add irrelevant context, increase latency, and sometimes even reduce answer quality. For events like those, use <strong>conditional retrieval</strong>: only fetch evidence when the question actually depends on external knowledge.</p>
<ul>
<li><strong>Self-RAG</strong> trains the model to decide <em>when</em> to retrieve, judge relevance, and critique whether evidence supports the answer. (<a target="_blank" href="https://arxiv.org/abs/2310.11511">arXiv:2310.11511)</a></li>
</ul>
<p><strong>Easy way to remember:</strong> <em>If it’s about your private docs → retrieve. If it’s general knowledge / writing help → skip.</em></p>
<h3 id="heading-upgrade-d-hierarchical-structured-retrieval-for-long-docs-global-context">Upgrade D: Hierarchical / structured retrieval for long docs (global context)</h3>
<p>Chunk retrieval is “local”: it’s good at finding a paragraph, but long documents and large corpora often require <strong>global understanding</strong> (themes, policies, cross-document synthesis). Hierarchical/structured retrieval adds a higher-level index so the system can retrieve both “big picture” summaries and drill-down details.</p>
<ul>
<li><p><strong>RAPTOR</strong> builds a tree of summaries and retrieves at multiple abstraction levels (high-level + detailed). (<a target="_blank" href="https://arxiv.org/abs/2401.18059">arXiv:2401.18059</a>)</p>
</li>
<li><p><strong>GraphRAG</strong> builds an entity/relationship graph and uses community summaries to answer broad corpus-wide questions. (<a target="_blank" href="https://arxiv.org/abs/2404.16130">arXiv:2404.16130</a>)</p>
</li>
</ul>
<p><strong>Easy way to remember:</strong> <em>Don’t just retrieve paragraphs - retrieve the map of the corpus too.</em></p>
<h2 id="heading-failure-modes-and-how-to-debug-them">Failure modes (and how to debug them)</h2>
<p>Even strong RAG systems fail in predictable ways. The fastest way to improve is to identify <em>which stage</em> broke (retrieval, ranking, context packing, or generation) and apply a targeted fix.</p>
<h3 id="heading-1-good-retrieval-bad-answer-the-model-ignores-evidence">1. Good retrieval, bad answer (the model ignores evidence)</h3>
<p><strong>What it looks like:</strong> the retrieved chunks clearly contain the answer, but the model responds with something generic, incomplete, or incorrect.</p>
<p><strong>Common causes:</strong></p>
<ul>
<li><p>the prompt doesn’t strongly enforce “use only context”</p>
</li>
<li><p>too much context noise buries the relevant lines</p>
</li>
<li><p>the truly relevant chunk is ranked low and gets truncated out</p>
</li>
</ul>
<p><strong>Fixes:</strong></p>
<ul>
<li><p><strong>Constrain the prompt:</strong> “Answer using ONLY the provided context. If it’s not there, say you don’t know.”</p>
</li>
<li><p><strong>Shorten / clean the context:</strong> reduce top-K, remove duplicates, or use tighter chunking so evidence is easier to spot</p>
</li>
<li><p><strong>Add a reranker:</strong> retrieve top-N broadly, rerank, then keep only top-K for generation</p>
</li>
</ul>
<h3 id="heading-2-bad-retrieval-confident-answer-hallucination">2. Bad retrieval, confident answer (hallucination)</h3>
<p><strong>What it looks like:</strong> retrieved chunks don’t support the answer (or are irrelevant), but the model answers confidently anyway.</p>
<p><strong>Fixes:</strong></p>
<ul>
<li><p><strong>Score thresholding:</strong> if retrieval scores are low (or relevance is uncertain), return “I don’t know from these sources” instead of guessing</p>
</li>
<li><p><strong>Require citations for claims:</strong> force the model to cite chunk IDs per claim; uncited claims are treated as invalid</p>
</li>
<li><p><strong>Use query rewriting / HyDE:</strong> rewrite the query into something more “retrieval-friendly,” especially when the user query is vague or mismatched to doc phrasing</p>
</li>
</ul>
<p><strong>HyDE in one sentence:</strong> generate a hypothetical “ideal” answer document, embed that, then retrieve real documents near it in embedding space. <a target="_blank" href="https://arxiv.org/abs/2212.10496">(arXiv:2212.10496</a>)</p>
<h3 id="heading-3-chunking-mismatch-the-answer-is-split-across-chunks">3. Chunking mismatch (the answer is split across chunks)</h3>
<p><strong>What it looks like:</strong> no single chunk contains enough evidence, because key details are split across boundaries (e.g., definition in one chunk, conditions in the next).</p>
<p><strong>Fixes:</strong></p>
<ul>
<li><p><strong>Increase overlap</strong> (or slightly increase chunk size) so related details stay together</p>
</li>
<li><p><strong>Use structure-aware chunking:</strong> split by headings/sections, keep code blocks intact, treat tables as units</p>
</li>
<li><p><strong>Consider hierarchical retrieval:</strong> approaches like RAPTOR retrieve both summaries and detailed leaves, helping when information is scattered across long docs. (<a target="_blank" href="https://arxiv.org/abs/2401.18059">arXiv:2401.18059</a>)</p>
</li>
</ul>
<h2 id="heading-evaluation-how-to-know-your-rag-is-improving">Evaluation: how to know your RAG is improving</h2>
<p>A common mistake is evaluating RAG only by “does the final answer look good?” That hides where things are actually failing. In practice, you want to measure the pipeline in layers. You need to separately measure:</p>
<ul>
<li><p><strong>retrieval quality</strong> (did we fetch relevant evidence?)</p>
</li>
<li><p><strong>groundedness</strong> (did the answer stick to evidence?)</p>
</li>
<li><p><strong>answer quality</strong> (is it helpful, complete, correct?)</p>
</li>
</ul>
<p><strong>RAGAS</strong> (Retrieval Augmented Generation Assessment) proposes reference-free metrics that score retrieval + faithfulness + answer quality without requiring gold labels for every query. (<a target="_blank" href="https://arxiv.org/abs/2309.15217">arXiv:2309.15217</a><a target="_blank" href="https://arxiv.org/abs/2401.18059?utm_source=chatgpt.com">)</a></p>
<p><strong>BEIR</strong> (Benchmarking Information Retrieval) is another widely used benchmark suite for evaluating retrieval models across <strong>many different datasets and domains</strong> - not just one curated task. The big lesson from BEIR is that retrieval models that look great on a single dataset often fail when you switch domains, query styles, or document types. That’s exactly what happens in real RAG apps: users ask messy questions, and your corpus has its own vocabulary and structure. If your RAG answer quality is inconsistent, BEIR is a good evaluation suite: your retriever might be “overfit” to one style of data and not robust to your actual workload. (<a target="_blank" href="https://arxiv.org/abs/2104.08663">arXiv:2104.</a><a target="_blank" href="https://arxiv.org/abs/2104.08663">08663)</a></p>
<h2 id="heading-a-practical-starter-rag-recipe-works-surprisingly-well">A practical “starter” RAG recipe (works surprisingly well)</h2>
<ol>
<li><p>Chunk at ~500 tokens with ~15% overlap</p>
</li>
<li><p>Use a strong general embedding model (E5-class) (<a target="_blank" href="https://arxiv.org/abs/2212.03533">arXiv:2212.03533</a>)</p>
</li>
<li><p>Retrieve top-30, then rerank down to top-5</p>
</li>
<li><p>Prompt with:</p>
<ul>
<li><p>instruction to only use provided context</p>
</li>
<li><p>citations per sentence or per paragraph</p>
</li>
<li><p>refusal when evidence is missing</p>
</li>
</ul>
</li>
<li><p>Evaluate with a small test set + RAGAS metrics (arXiv)</p>
</li>
<li><p>Add multi-query fusion when recall is the problem (arXiv)</p>
</li>
</ol>
<h2 id="heading-closing-thought">Closing thought</h2>
<p>RAG isn’t one trick - it’s a <strong>system</strong>. Your quality comes from how w<a target="_blank" href="https://arxiv.org/abs/2212.10496?utm_source=chatgpt.com">ell retrieval, c</a><a target="_blank" href="https://arxiv.org/abs/2401.18059?utm_source=chatgpt.com">hunking, reranking, prom</a>pting, and evaluat<a target="_blank" href="https://arxiv.org/abs/2212.10496?utm_source=chatgpt.com">ion</a> <a target="_blank" href="https://arxiv.org/abs/2104.08663?utm_source=chatgpt.com">work together.</a></p>
<h2 id="heading-reading-list-important-rag-papers">Reading list: important RAG papers</h2>
<h3 id="heading-foundations">Foundations</h3>
<ol>
<li><p><strong>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</strong>: <a target="_blank" href="https://arxiv.org/abs/2005.11401?utm_source=chatgpt.com">https://arxiv.org/abs/2005.11401</a></p>
</li>
<li><p><strong>REALM: Retrieval-Augmented Language Model Pre-Training</strong>: <a target="_blank" href="https://arxiv.org/abs/2002.08909?utm_source=chatgpt.com">https://arxiv.org/abs/2002.08909</a></p>
</li>
<li><p><strong>Dense Passage Retrieval for Open-Domain Question Answering</strong>: <a target="_blank" href="https://arxiv.org/abs/2004.04906?utm_source=chatgpt.com">https://arxiv.org/abs/2004.04906</a></p>
</li>
<li><p><strong>Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering</strong>: <a target="_blank" href="https://arxiv.org/abs/2007.01282?utm_source=chatgpt.com">https://arxiv.org/abs/2007.01282</a></p>
</li>
</ol>
<h3 id="heading-retrieval-quality-and-candidate-generation">Retrieval quality and candidate generation</h3>
<ol start="5">
<li><p><strong>BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models</strong>: <a target="_blank" href="https://arxiv.org/abs/2104.08663?utm_source=chatgpt.com">https://arxiv.org/abs/2104.08663</a></p>
</li>
<li><p><strong>Unsupervised Dense Information Retrieval with Contrastive Learning</strong>: <a target="_blank" href="https://arxiv.org/abs/2112.09118?utm_source=chatgpt.com">https://arxiv.org/abs/2112.09118</a></p>
</li>
<li><p><strong>Text Embeddings by Weakly-Supervised Contrastive Pre-training</strong>: <a target="_blank" href="https://arxiv.org/abs/2212.03533?utm_source=chatgpt.com">https://arxiv.org/abs/2212.03533</a></p>
</li>
</ol>
<h3 id="heading-reranking-and-late-interaction">Reranking and late interaction</h3>
<ol start="8">
<li><p><strong>ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT</strong>: <a target="_blank" href="https://arxiv.org/abs/2004.12832?utm_source=chatgpt.com">https://arxiv.org/abs/2004.12832</a></p>
</li>
<li><p><strong>ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction</strong>: <a target="_blank" href="https://arxiv.org/abs/2112.01488?utm_source=chatgpt.com">https://arxiv.org/abs/2112.01488</a></p>
</li>
<li><p><strong>SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking</strong>: <a target="_blank" href="https://arxiv.org/abs/2107.05720?utm_source=chatgpt.com">https://arxiv.org/abs/2107.05720</a></p>
</li>
</ol>
<h3 id="heading-query-rewriting-fusion-robustness">Query rewriting, fusion, robustness</h3>
<ol start="11">
<li><p><strong>Precise Zero-Shot Dense Retrieval without Relevance Labels</strong>: <a target="_blank" href="https://arxiv.org/abs/2212.10496?utm_source=chatgpt.com">https://arxiv.org/abs/2212.10496</a></p>
</li>
<li><p><strong>RAG-Fusion: a New Take on Retrieval-Augmented Generation</strong>: <a target="_blank" href="https://arxiv.org/abs/2402.03367?utm_source=chatgpt.com">https://arxiv.org/abs/2402.03367</a></p>
</li>
</ol>
<h3 id="heading-adaptive-and-agentic-rag">Adaptive and agentic RAG</h3>
<ol start="13">
<li><strong>Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection</strong>: <a target="_blank" href="https://arxiv.org/abs/2310.11511?utm_source=chatgpt.com">https://arxiv.org/abs/2310.11511</a></li>
</ol>
<h3 id="heading-long-context-corpora-and-structured-retrieval">Long-context corpora and structured retrieval</h3>
<ol start="14">
<li><p><strong>RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval</strong>: <a target="_blank" href="https://arxiv.org/abs/2401.18059?utm_source=chatgpt.com">https://arxiv.org/abs/2401.18059</a></p>
</li>
<li><p><strong>From Local to Global: A Graph RAG Approach to Query-Focused Summarization</strong>: <a target="_blank" href="https://arxiv.org/abs/2404.16130?utm_source=chatgpt.com">https://arxiv.org/abs/2404.16130</a></p>
</li>
</ol>
<h3 id="heading-evaluation">Evaluation</h3>
<ol start="16">
<li><p><strong>RAGAS: Automated Evaluation of Retrieval Augmented Generation</strong>: <a target="_blank" href="https://arxiv.org/abs/2309.15217?utm_source=chatgpt.com">https://arxiv.org/abs/2309.15217</a></p>
</li>
<li><p><strong>KILT: a Benchmark for Knowledge Intensive Language Tasks</strong>: <a target="_blank" href="https://arxiv.org/abs/2009.02252?utm_source=chatgpt.com">https://arxiv.org/abs/2009.02252</a></p>
</li>
</ol>
]]></content:encoded></item></channel></rss>