Skip to main content

Command Palette

Search for a command to run...

RepoRAG: Repo-Native RAG for Local Folders & Remote Git Repos

Updated
7 min read
RepoRAG: Repo-Native RAG for Local Folders & Remote Git Repos

When I am dropped into a new codebase, I don’t want “general advice.” I want specific, grounded answers:

  1. Where is the main LLM pipeline entry point?

  2. How does the end-to-end inference flow work?

  3. Where is retrieval feature implemented?

The problem is that codebases are large, fast-moving, and full of local conventions. Even good LLMs can hallucinate or give high-level answers unless you force them to retrieve evidence.

RepoRAG is my attempt at a pragmatic middle ground: a small CLI that builds a FAISS vector index from a local folder or a remote Git repo, then answers questions with retrieval-augmented generation and file-path citations so you can verify quickly.

This post explains the design, the tradeoffs, and what I learned building a repo-native RAG tool that’s meant to be practical, not magical.

What RepoRAG is (and what it isn’t)

RepoRAG is a CLI that:

  1. Ingests a local folder into a FAISS index

  2. Ingests a remote Git repo (clone → index → delete clone)

  3. Answers questions against the saved index with grounded answers + source file paths

  4. Uses file filters + smaller embedding models to speed up large repo ingest

It’s not trying to be a full agent framework, a PR reviewer, or an IDE plugin (yet). It’s intentionally a simple workflow you can run in a terminal when you need answers that are traceable back to the repo.

Why file-path citations are the “killer feature”

A lot of RAG demos show citations as chunks of text. For code, that’s not enough. What I actually want is:

  1. a short answer, and

  2. a shortlist of files that I should open next.

RepoRAG is optimized around that: answers are grounded and include source file paths (so I can jump into the code immediately). This changes the UX from “trust the model” to “use the model to triage where to look.”

High-Level Architecture

At a high level, RepoRAG is the standard RAG loop applied to code:

  1. Collect documents (files from a local folder or a cloned repo)

  2. Chunk documents

  3. Embed chunks

  4. Store in FAISS

  5. On question: retrieve top-k chunks

  6. Generate an answer constrained by retrieved context

  7. Return answer + file-path citations

Repo structure and why it matters

RepoRAG keeps the layout intentionally small and readable. The project structure (as documented) separates the CLI entry point, configuration, ingestion, retrieval, and remote handling.

  • rag_cli.py – CLI entry point

  • reporag/ – config, loaders, ingest, rag, remote

  • .reporag_index/ – this is the vector index, created after ingest

Workflow: local ingest, remote ingest, and ask

RepoRAG exposes three core user actions:

1. Index a local folder

python3 rag_cli.py ingest .

This is the “index whatever is here” workflow.

2. Index a remote repo (clone → index → delete clone)

python3 rag_cli.py ingest-remote https://github.com/langgenius/dify.git --depth 1

This is useful when you want to interrogate a repo you haven’t checked out. RepoRAG’s documented behavior is to delete the cloned repo directory after indexing. --depth 1 means a shallow clone of the repo with only the latest commit history (roughly: the newest snapshot). It downloads far less git history, so cloning is faster and smaller.

3) Ask questions (with k)

python3 rag_cli.py ask "Where is the LLM entry point?"
python3 rag_cli.py ask "How is authentication handled end-to-end?" --k 10

--k 10 is the number of retrieved chunks RepoRAG pulls from the FAISS index to use as context for answering. The --k control matters: it’s the simplest knob for trading off recall vs noise.

Provider abstraction: OpenAI and local options

A core design goal is portability: the LLM + embeddings provider can be swapped via reporag/config.py, and the examples use OpenAI but you can swap providers via config/env.

An example .env using OpenAI includes:

  • LLM_PROVIDER=openai

  • OPENAI_API_KEY=...

  • OPENAI_EMBED_MODEL=text-embedding-3-small

  • optional OPENAI_EMBED_DIM=768

  • OPENAI_CHAT_MODEL=gpt-4o-mini

  • INDEX_DIR=.reporag_index

Even if you run locally (Ollama, etc.), this split stays the same conceptually: you want an embeddings model that’s fast and a chat model that’s good at “answer from evidence.”

The real engineering decisions in RepoRAG

Most RepoRAG concepts live or die on boring details. Here are the big ones.

1. Chunking strategy: code isn’t prose

Chunking code is tricky because semantics are often non-local:

  • A function signature in one file, implementation elsewhere

  • A config key defined in a .yaml and read in Python

  • Auth flows spanning middleware + routes + handlers

A naive chunking strategy can either:

  • fragment important context, or

  • create huge chunks that embed slowly and retrieve noisily

RepoRAG explicitly calls out chunk size/overlap as a performance lever (increase chunk size / reduce overlap to reduce total chunks). That’s not just performance - changing chunking affects retrieval accuracy. Bigger chunks can improve “contain the full idea,” but can also increase irrelevant matches.

Practical heuristic: start with conservative chunking (smaller, more overlap) for correctness; once it works, tune chunking for speed.

2. Retrieval k: the simplest knob that matters

RepoRAG supports increasing k (example shows --k 10).

  • Low k → cleaner context, but higher chance you miss the key file

  • High k → better recall, but more distractions (and longer prompts)

In practice, I like:

  • k=4–6 for “where is X implemented?”

  • k=8–12 for “explain end-to-end flow”

3. Index location and reproducibility

RepoRAG stores the index under a configurable directory (by default INDEX_DIR=.reporag_index).

That is a good default because:

  • it keeps the index near the repo

  • it’s easy to delete and rebuild

  • it doesn’t pollute global state

Note: You would need to rebuild the index when embedding model or dimensions change (delete .reporag_index and re-ingest). This is critical as embedding dimension mismatches will break retrieval in subtle ways.

4. Large repo runtime performance: embeddings dominate

For large repos, embedding time dominates ingestion.

Suggested optimizations include:

  • Use a smaller embedding model (example: text-embedding-3-small)

  • Optionally reduce embedding dimensions (example: OPENAI_EMBED_DIM=768)

  • Restrict indexing to key folders (example list: ["api","web","docs"])

  • Tune chunk size / overlap to reduce chunk count

Security: your index is sensitive

RepoRAG includes a security note that’s easy to ignore but extremely important:

  • ingest-remote argument deletes the cloned repo directory after indexing

  • but the FAISS index contains embedded chunks of text used for retrieval

  • treat the index as sensitive if the repo is private

In practice, this implies:

  • Don’t casually upload .reporag_index to public storage

  • Don’t email the index around

  • Consider excluding it from git

  • Consider encrypting at rest if you store it somewhere shared

What “good” looks like: grounded answers + fast verification

When RepoRAG is working well, the interaction looks like:

  1. I ask a question

  2. RepoRAG returns a short explanation

  3. It gives me a few file paths

  4. I open those files and confirm quickly

These file-path citations are important. They turn the LLM from an oracle into a navigation assistant.

Failure modes I expect (and how I think about them)

Even without fancy evaluation harnesses, RepoRAG systems fail in predictable ways:

1. “The index doesn’t contain it”

  • You filtered out the folder that contains the answer so as to trade-off ingestion speed for performance

  • The file type wasn’t loaded

  • The repo changed since ingest

Fix: re-ingest; relax filters; ensure key folders are included.

2. “Retrieved chunks are close, but not the answer”

Common with:

  • generic naming (utils.py, helpers.ts)

  • repeated patterns (auth middleware in multiple apps)

  • generated code

Fix: increase k; tighten chunking; add metadata signals (future improvement).

3. “LLM answers beyond the evidence”

This is the classic RAG hallucination problem.

Fix: prompt contract: answer only from retrieved context; if insufficient, say so and cite what you do have.

What I would improve next (if I keep iterating)

These are not any commitments - just the natural next steps for RepoRAG:

  1. Evaluation loop: a small curated Q/A set per repo (“golden questions”) + regression checks

  2. Hybrid retrieval: BM25 + vectors (code identifiers and exact strings matter a lot)

  3. Better citations: include file path + line ranges when possible

  4. Incremental indexing: only re-embed changed files

  5. Tracing: log retrieval hits, chunk IDs, prompt length, latency (so tuning becomes data-driven)

Try it yourself

RepoRAG is on GitHub:
https://github.com/hisahoo009/RepoRAG

Quick start (local folder)

git clone https://github.com/hisahoo009/RepoRAG.git
cd RepoRAG

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt

cp .env.example .env

python3 rag_cli.py ingest .
python3 rag_cli.py ask "Where is the backend entry point?"

Index a remote repo (example)

python3 rag_cli.py ingest-remote https://github.com/OpenPipe/ART.git --depth 1
python3 rag_cli.py ask "How do I run ART ?"