Sahoo Labs

RepoRAG: Repo-Native RAG for Local Folders & Remote Git Repos

Himanshu Shekhar Sahoo — Mon, 02 Feb 2026 18:09:53 GMT

When I am dropped into a new codebase, I don’t want “general advice.” I want specific, grounded answers:

Where is the main LLM pipeline entry point?
How does the end-to-end inference flow work?
Where is retrieval feature implemented?

The problem is that codebases are large, fast-moving, and full of local conventions. Even good LLMs can hallucinate or give high-level answers unless you force them to retrieve evidence.

RepoRAG is my attempt at a pragmatic middle ground: a small CLI that builds a FAISS vector index from a local folder or a remote Git repo, then answers questions with retrieval-augmented generation and file-path citations so you can verify quickly.

This post explains the design, the tradeoffs, and what I learned building a repo-native RAG tool that’s meant to be practical, not magical.

What RepoRAG is (and what it isn’t)

RepoRAG is a CLI that:

Ingests a local folder into a FAISS index
Ingests a remote Git repo (clone → index → delete clone)
Answers questions against the saved index with grounded answers + source file paths
Uses file filters + smaller embedding models to speed up large repo ingest

It’s not trying to be a full agent framework, a PR reviewer, or an IDE plugin (yet). It’s intentionally a simple workflow you can run in a terminal when you need answers that are traceable back to the repo.

Why file-path citations are the “killer feature”

A lot of RAG demos show citations as chunks of text. For code, that’s not enough. What I actually want is:

a short answer, and
a shortlist of files that I should open next.

RepoRAG is optimized around that: answers are grounded and include source file paths (so I can jump into the code immediately). This changes the UX from “trust the model” to “use the model to triage where to look.”

High-Level Architecture

At a high level, RepoRAG is the standard RAG loop applied to code:

Collect documents (files from a local folder or a cloned repo)
Chunk documents
Embed chunks
Store in FAISS
On question: retrieve top-k chunks
Generate an answer constrained by retrieved context
Return answer + file-path citations

Repo structure and why it matters

RepoRAG keeps the layout intentionally small and readable. The project structure (as documented) separates the CLI entry point, configuration, ingestion, retrieval, and remote handling.

rag_cli.py – CLI entry point
reporag/ – config, loaders, ingest, rag, remote
.reporag_index/ – this is the vector index, created after ingest

Workflow: local ingest, remote ingest, and ask

RepoRAG exposes three core user actions:

1. Index a local folder

python3 rag_cli.py ingest .

This is the “index whatever is here” workflow.

2. Index a remote repo (clone → index → delete clone)

python3 rag_cli.py ingest-remote https://github.com/langgenius/dify.git --depth 1

This is useful when you want to interrogate a repo you haven’t checked out. RepoRAG’s documented behavior is to delete the cloned repo directory after indexing. --depth 1 means a shallow clone of the repo with only the latest commit history (roughly: the newest snapshot). It downloads far less git history, so cloning is faster and smaller.

3) Ask questions (with `k`)

python3 rag_cli.py ask "Where is the LLM entry point?"
python3 rag_cli.py ask "How is authentication handled end-to-end?" --k 10

--k 10 is the number of retrieved chunks RepoRAG pulls from the FAISS index to use as context for answering. The --k control matters: it’s the simplest knob for trading off recall vs noise.

Provider abstraction: OpenAI and local options

A core design goal is portability: the LLM + embeddings provider can be swapped via reporag/config.py, and the examples use OpenAI but you can swap providers via config/env.

An example .env using OpenAI includes:

LLM_PROVIDER=openai
OPENAI_API_KEY=...
OPENAI_EMBED_MODEL=text-embedding-3-small
optional OPENAI_EMBED_DIM=768
OPENAI_CHAT_MODEL=gpt-4o-mini
INDEX_DIR=.reporag_index

Even if you run locally (Ollama, etc.), this split stays the same conceptually: you want an embeddings model that’s fast and a chat model that’s good at “answer from evidence.”

The real engineering decisions in RepoRAG

Most RepoRAG concepts live or die on boring details. Here are the big ones.

1. Chunking strategy: code isn’t prose

Chunking code is tricky because semantics are often non-local:

A function signature in one file, implementation elsewhere
A config key defined in a .yaml and read in Python
Auth flows spanning middleware + routes + handlers

A naive chunking strategy can either:

fragment important context, or
create huge chunks that embed slowly and retrieve noisily

RepoRAG explicitly calls out chunk size/overlap as a performance lever (increase chunk size / reduce overlap to reduce total chunks). That’s not just performance - changing chunking affects retrieval accuracy. Bigger chunks can improve “contain the full idea,” but can also increase irrelevant matches.

Practical heuristic: start with conservative chunking (smaller, more overlap) for correctness; once it works, tune chunking for speed.

2. Retrieval `k`: the simplest knob that matters

RepoRAG supports increasing k (example shows --k 10).

Low k → cleaner context, but higher chance you miss the key file
High k → better recall, but more distractions (and longer prompts)

In practice, I like:

k=4–6 for “where is X implemented?”
k=8–12 for “explain end-to-end flow”

3. Index location and reproducibility

RepoRAG stores the index under a configurable directory (by default INDEX_DIR=.reporag_index).

That is a good default because:

it keeps the index near the repo
it’s easy to delete and rebuild
it doesn’t pollute global state

Note: You would need to rebuild the index when embedding model or dimensions change (delete .reporag_index and re-ingest). This is critical as embedding dimension mismatches will break retrieval in subtle ways.

4. Large repo runtime performance: embeddings dominate

For large repos, embedding time dominates ingestion.

Suggested optimizations include:

Use a smaller embedding model (example: text-embedding-3-small)
Optionally reduce embedding dimensions (example: OPENAI_EMBED_DIM=768)
Restrict indexing to key folders (example list: ["api","web","docs"])
Tune chunk size / overlap to reduce chunk count

Security: your index is sensitive

RepoRAG includes a security note that’s easy to ignore but extremely important:

ingest-remote argument deletes the cloned repo directory after indexing
but the FAISS index contains embedded chunks of text used for retrieval
treat the index as sensitive if the repo is private

In practice, this implies:

Don’t casually upload .reporag_index to public storage
Don’t email the index around
Consider excluding it from git
Consider encrypting at rest if you store it somewhere shared

What “good” looks like: grounded answers + fast verification

When RepoRAG is working well, the interaction looks like:

I ask a question
RepoRAG returns a short explanation
It gives me a few file paths
I open those files and confirm quickly

These file-path citations are important. They turn the LLM from an oracle into a navigation assistant.

Failure modes I expect (and how I think about them)

Even without fancy evaluation harnesses, RepoRAG systems fail in predictable ways:

1. “The index doesn’t contain it”

You filtered out the folder that contains the answer so as to trade-off ingestion speed for performance
The file type wasn’t loaded
The repo changed since ingest

Fix: re-ingest; relax filters; ensure key folders are included.

2. “Retrieved chunks are close, but not the answer”

Common with:

generic naming (utils.py, helpers.ts)
repeated patterns (auth middleware in multiple apps)
generated code

Fix: increase k; tighten chunking; add metadata signals (future improvement).

3. “LLM answers beyond the evidence”

This is the classic RAG hallucination problem.

Fix: prompt contract: answer only from retrieved context; if insufficient, say so and cite what you do have.

What I would improve next (if I keep iterating)

These are not any commitments - just the natural next steps for RepoRAG:

Evaluation loop: a small curated Q/A set per repo (“golden questions”) + regression checks
Hybrid retrieval: BM25 + vectors (code identifiers and exact strings matter a lot)
Better citations: include file path + line ranges when possible
Incremental indexing: only re-embed changed files
Tracing: log retrieval hits, chunk IDs, prompt length, latency (so tuning becomes data-driven)

Try it yourself

RepoRAG is on GitHub:
https://github.com/hisahoo009/RepoRAG

Quick start (local folder)

git clone https://github.com/hisahoo009/RepoRAG.git
cd RepoRAG

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt

cp .env.example .env

python3 rag_cli.py ingest .
python3 rag_cli.py ask "Where is the backend entry point?"

Index a remote repo (example)

python3 rag_cli.py ingest-remote https://github.com/OpenPipe/ART.git --depth 1
python3 rag_cli.py ask "How do I run ART ?"

Retrieval-Augmented Generation (RAG): the practical guide to building grounded LLM apps

Himanshu Shekhar Sahoo — Fri, 30 Jan 2026 23:16:00 GMT

Large Language Models are great at fluent text - but they’re not great at being consistently correct, up-to-date, or able to cite where an answer came from. Retrieval-Augmented Generation (RAG) is the most common pattern for fixing that: fetch relevant knowledge first, then generate an answer grounded in it.

This post explains:

what RAG is (and isn’t),
the core building blocks,
the “modern” upgrades that actually determine quality,
and how to evaluate + debug RAG like an engineer.

What RAG is

RAG = Retriever + Generator

Instead of forcing the LLM to “remember everything”, you:

retrieve the most relevant chunks from your knowledge base (docs, PDFs, wiki, tickets, code, etc.)
inject those chunks into the prompt
generate an answer conditioned on that evidence (ideally with citations)

The canonical pipeline

flowchart LR
  Q[User Query] --> R[Retriever]
  R -->|Top-k chunks| C[Context Builder]
  C --> P[Prompt]
  P --> L[LLM]
  L --> A[Answer + Citations]

  classDef step fill:#f5f5f5,stroke:#333,stroke-width:1px,rx:8,ry:8;
  class Q,R,C,P,L,A step;

This basic idea was formalized in the original RAG paper by Lewis et al. (2020). (arXiv:2005.11401)

The 4 core building blocks

1. Chunking (how you split knowledge)

Chunking is the step where you split your source documents into retrieval units (“chunks”) that will be embedded, indexed, and later fed to the LLM as evidence. It’s not just a preprocessing detail - it directly shapes recall, precision, and even whether the model can cite the right passage. Good chunking usually respects natural document structure (headings, paragraphs, code blocks, tables) and aims to keep each chunk semantically coherent so a single retrieved piece can stand on its own.

Chunking controls what retrieval can “see.”

Too small → missing context
Too big → irrelevant junk + token waste
Common baseline: 300-800 tokens, with 10–20% overlap, then iterate.

2. Embeddings (how you represent text for search)

Embeddings are how we turn text into numbers that capture meaning, so we can do “semantic search” instead of pure keyword search. Concretely, an embedding model maps a query and each chunk into vectors, and retrieval becomes “find the chunks whose vectors are closest to the query.” This matters because real users rarely use the same wording as your docs - embeddings help match paraphrases, concepts, and intent (e.g., “cancel subscription” ↔ “terminate plan”). Your RAG quality often lives or dies on whether your embedding model preserves the right signals for your content (definitions, procedures, code, product names, etc.).

3. Retrieval (how you fetch candidates)

Two major retrieval families show up in almost every RAG system, and they behave differently because they’re matching different signals:

a. Sparse retrieval (BM25 / keyword search)

BM25 is a classic information-retrieval scoring method used by many search engines. It ranks documents mainly by keyword overlap (with smart weighting so rare/important words count more than common ones like “the” or “and”). It’s great when the user’s query contains exact tokens you must match:

IDs (INC-12491)
error codes (CUDA_ERROR_ILLEGAL_ADDRESS)
version numbers (v2.3.1)
names or exact phrases

Easy way to remember: BM25 is like Ctrl+F on steroids - it finds the pages that literally contain what you typed, and ranks the best matches first.

b. Dense retrieval (vector / embedding search)

Dense retrieval turns text into embeddings (vectors) so it can match meaning, not exact wording. This helps when users and docs say the same thing differently:

“cancel subscription” ↔ “terminate plan”
“model is slow” ↔ “high latency inference”
“add login” ↔ “implement authentication”

Easy way to remember: dense retrieval is like asking, “Which passages talk about the same idea?” even if they don’t share the same words.

c. Why you often want both (hybrid retrieval)

Real user queries mix both:

“What does error 0x80070005 mean?” (exact token + intent)
“Why did performance drop after upgrading?” (semantic, broad)

So many production RAG systems use hybrid retrieval:

BM25 to catch exact anchors (codes, names, IDs)
Dense retrieval to catch paraphrases and concepts
a reranker to sort the best few (we will cover this below)

4. Generation (how you answer from evidence)

Generation is where the LLM turns retrieved chunks into a final response. The goal isn’t just fluent text - it’s grounded answers: the model should respond based on the evidence you retrieved, not its own assumptions. Think of the retrieved chunks as “open book notes” the model is allowed to use. If the notes don’t contain the answer, the correct behavior is to say so.

To make generation reliable, add three constraints:

Citations (chunk IDs): Require citations like [C2] or [DocA#12] for each major claim. This keeps answers auditable and discourages hallucinations.
Refusal when evidence is missing: If the answer isn’t supported by the retrieved context, the model should explicitly say “I don’t know from these sources” and suggest what to retrieve next.
A short “What I used” section: End with a compact list of the chunks that actually supported the response. This makes debugging easy (“retrieval failed” vs “model ignored evidence”).

Example output format

Answer: … [C1][C4]
Key evidence:
- … [C1]
- … [C4]
What I used: C1, C4
If missing: “Not enough evidence in the provided context to answer.”

Prompt snippet

“Use only the provided context. If unsupported, say you don’t know.”
“Cite chunk IDs like [C#] for each key claim.”
“End with What I used: list of chunk IDs.”

The “modern RAG” upgrades that really matter

Once you have a basic RAG pipeline working, most quality gains come from a few “systems” upgrades. These aren’t fancy extras - they’re the difference between mostly relevant and consistently useful.

Upgrade A: Re-ranking (precision booster)

Vector search is great at pulling a decent candidate set, but the top results often include “near misses.” A re-ranker takes the top-N retrieved chunks (e.g., 20–100) and sorts them using a stronger relevance signal, so the final top-K context is much tighter and easier for the LLM to ground on.

Late-interaction retrieval like ColBERT improved precision by scoring query–document similarity at a finer (token) level. (arXiv:2004.12832)
ColBERTv2 made late interaction more practical by reducing the storage footprint while improving quality. (arXiv:2112.01488)
In practice, lightweight general rerankers (e.g., BGE rerankers) are often the easiest win because they’re drop-in and noticeably improve top-k relevance.

Key takeaway: Retrieve wide → rerank narrow → prompt only the best.

Upgrade B: Multi-query retrieval (coverage booster)

Single queries are brittle: users omit keywords, use unusual phrasing, or ask multi-part questions. Multi-query retrieval generates a few query variations (rephrasings, sub-questions), retrieves for each, then fuses results into one stronger candidate pool.

RAG-Fusion is a popular approach: generate multiple queries, retrieve for each, then merge rankings using Reciprocal Rank Fusion (RRF). (arXiv:2402.03367)

Easy way to remember: Ask the search engine the same question 5 different ways, then combine the best hits.

Upgrade C: “Retrieve only when needed” (reduce noise + cost)

Not every question benefits from retrieval (e.g., “Explain RAG” or “Rewrite this paragraph”). Always retrieving can add irrelevant context, increase latency, and sometimes even reduce answer quality. For events like those, use conditional retrieval: only fetch evidence when the question actually depends on external knowledge.

Self-RAG trains the model to decide when to retrieve, judge relevance, and critique whether evidence supports the answer. (arXiv:2310.11511)

Easy way to remember: If it’s about your private docs → retrieve. If it’s general knowledge / writing help → skip.

Upgrade D: Hierarchical / structured retrieval for long docs (global context)

Chunk retrieval is “local”: it’s good at finding a paragraph, but long documents and large corpora often require global understanding (themes, policies, cross-document synthesis). Hierarchical/structured retrieval adds a higher-level index so the system can retrieve both “big picture” summaries and drill-down details.

RAPTOR builds a tree of summaries and retrieves at multiple abstraction levels (high-level + detailed). (arXiv:2401.18059)
GraphRAG builds an entity/relationship graph and uses community summaries to answer broad corpus-wide questions. (arXiv:2404.16130)

Easy way to remember: Don’t just retrieve paragraphs - retrieve the map of the corpus too.

Failure modes (and how to debug them)

Even strong RAG systems fail in predictable ways. The fastest way to improve is to identify which stage broke (retrieval, ranking, context packing, or generation) and apply a targeted fix.

1. Good retrieval, bad answer (the model ignores evidence)

What it looks like: the retrieved chunks clearly contain the answer, but the model responds with something generic, incomplete, or incorrect.

Common causes:

the prompt doesn’t strongly enforce “use only context”
too much context noise buries the relevant lines
the truly relevant chunk is ranked low and gets truncated out

Fixes:

Constrain the prompt: “Answer using ONLY the provided context. If it’s not there, say you don’t know.”
Shorten / clean the context: reduce top-K, remove duplicates, or use tighter chunking so evidence is easier to spot
Add a reranker: retrieve top-N broadly, rerank, then keep only top-K for generation

2. Bad retrieval, confident answer (hallucination)

What it looks like: retrieved chunks don’t support the answer (or are irrelevant), but the model answers confidently anyway.

Fixes:

Score thresholding: if retrieval scores are low (or relevance is uncertain), return “I don’t know from these sources” instead of guessing
Require citations for claims: force the model to cite chunk IDs per claim; uncited claims are treated as invalid
Use query rewriting / HyDE: rewrite the query into something more “retrieval-friendly,” especially when the user query is vague or mismatched to doc phrasing

HyDE in one sentence: generate a hypothetical “ideal” answer document, embed that, then retrieve real documents near it in embedding space. (arXiv:2212.10496)

3. Chunking mismatch (the answer is split across chunks)

What it looks like: no single chunk contains enough evidence, because key details are split across boundaries (e.g., definition in one chunk, conditions in the next).

Fixes:

Increase overlap (or slightly increase chunk size) so related details stay together
Use structure-aware chunking: split by headings/sections, keep code blocks intact, treat tables as units
Consider hierarchical retrieval: approaches like RAPTOR retrieve both summaries and detailed leaves, helping when information is scattered across long docs. (arXiv:2401.18059)

Evaluation: how to know your RAG is improving

A common mistake is evaluating RAG only by “does the final answer look good?” That hides where things are actually failing. In practice, you want to measure the pipeline in layers. You need to separately measure:

retrieval quality (did we fetch relevant evidence?)
groundedness (did the answer stick to evidence?)
answer quality (is it helpful, complete, correct?)

RAGAS (Retrieval Augmented Generation Assessment) proposes reference-free metrics that score retrieval + faithfulness + answer quality without requiring gold labels for every query. (arXiv:2309.15217 )

BEIR (Benchmarking Information Retrieval) is another widely used benchmark suite for evaluating retrieval models across many different datasets and domains - not just one curated task. The big lesson from BEIR is that retrieval models that look great on a single dataset often fail when you switch domains, query styles, or document types. That’s exactly what happens in real RAG apps: users ask messy questions, and your corpus has its own vocabulary and structure. If your RAG answer quality is inconsistent, BEIR is a good evaluation suite: your retriever might be “overfit” to one style of data and not robust to your actual workload. (arXiv:2104.08663)

A practical “starter” RAG recipe (works surprisingly well)

Chunk at ~500 tokens with ~15% overlap
Use a strong general embedding model (E5-class) (arXiv:2212.03533)
Retrieve top-30, then rerank down to top-5
Prompt with:
- instruction to only use provided context
- citations per sentence or per paragraph
- refusal when evidence is missing
Evaluate with a small test set + RAGAS metrics (arXiv)
Add multi-query fusion when recall is the problem (arXiv)

Closing thought

RAG isn’t one trick - it’s a system. Your quality comes from how well retrieval, c hunking, reranking, prompting, and evaluation work together.

Reading list: important RAG papers

Foundations

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: https://arxiv.org/abs/2005.11401
REALM: Retrieval-Augmented Language Model Pre-Training: https://arxiv.org/abs/2002.08909
Dense Passage Retrieval for Open-Domain Question Answering: https://arxiv.org/abs/2004.04906
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering: https://arxiv.org/abs/2007.01282

Retrieval quality and candidate generation

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models: https://arxiv.org/abs/2104.08663
Unsupervised Dense Information Retrieval with Contrastive Learning: https://arxiv.org/abs/2112.09118
Text Embeddings by Weakly-Supervised Contrastive Pre-training: https://arxiv.org/abs/2212.03533

Reranking and late interaction

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT: https://arxiv.org/abs/2004.12832
ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction: https://arxiv.org/abs/2112.01488
SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking: https://arxiv.org/abs/2107.05720

Query rewriting, fusion, robustness

Precise Zero-Shot Dense Retrieval without Relevance Labels: https://arxiv.org/abs/2212.10496
RAG-Fusion: a New Take on Retrieval-Augmented Generation: https://arxiv.org/abs/2402.03367

Adaptive and agentic RAG

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection: https://arxiv.org/abs/2310.11511

Long-context corpora and structured retrieval

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval: https://arxiv.org/abs/2401.18059
From Local to Global: A Graph RAG Approach to Query-Focused Summarization: https://arxiv.org/abs/2404.16130

Evaluation

RAGAS: Automated Evaluation of Retrieval Augmented Generation: https://arxiv.org/abs/2309.15217
KILT: a Benchmark for Knowledge Intensive Language Tasks: https://arxiv.org/abs/2009.02252

Sahoo Labs

RepoRAG: Repo-Native RAG for Local Folders & Remote Git Repos

What RepoRAG is (and what it isn’t)

Why file-path citations are the “killer feature”

High-Level Architecture

Repo structure and why it matters

Workflow: local ingest, remote ingest, and ask

1. Index a local folder

2. Index a remote repo (clone → index → delete clone)

3) Ask questions (with k)

Provider abstraction: OpenAI and local options

The real engineering decisions in RepoRAG

1. Chunking strategy: code isn’t prose

2. Retrieval k: the simplest knob that matters

3. Index location and reproducibility

4. Large repo runtime performance: embeddings dominate

Security: your index is sensitive

What “good” looks like: grounded answers + fast verification

Failure modes I expect (and how I think about them)

1. “The index doesn’t contain it”

2. “Retrieved chunks are close, but not the answer”

3. “LLM answers beyond the evidence”

What I would improve next (if I keep iterating)

Try it yourself

Quick start (local folder)

Index a remote repo (example)

Retrieval-Augmented Generation (RAG): the practical guide to building grounded LLM apps

What RAG is

The canonical pipeline

The 4 core building blocks

1. Chunking (how you split knowledge)

2. Embeddings (how you represent text for search)

3. Retrieval (how you fetch candidates)

4. Generation (how you answer from evidence)

The “modern RAG” upgrades that really matter

Upgrade A: Re-ranking (precision booster)

Upgrade B: Multi-query retrieval (coverage booster)

Upgrade C: “Retrieve only when needed” (reduce noise + cost)

Upgrade D: Hierarchical / structured retrieval for long docs (global context)

Failure modes (and how to debug them)

1. Good retrieval, bad answer (the model ignores evidence)

2. Bad retrieval, confident answer (hallucination)

3. Chunking mismatch (the answer is split across chunks)

Evaluation: how to know your RAG is improving

A practical “starter” RAG recipe (works surprisingly well)

Closing thought

Reading list: important RAG papers

Foundations

Retrieval quality and candidate generation

Reranking and late interaction

Query rewriting, fusion, robustness

Adaptive and agentic RAG

Long-context corpora and structured retrieval

Evaluation

3) Ask questions (with `k`)

2. Retrieval `k`: the simplest knob that matters