RepoRAG: Repo-Native RAG for Local Folders & Remote Git Repos

When I am dropped into a new codebase, I don’t want “general advice.” I want specific, grounded answers:
Where is the main LLM pipeline entry point?
How does the end-to-end inference flow work?
Where is retrieval feature implemented?
The problem is that codebases are large, fast-moving, and full of local conventions. Even good LLMs can hallucinate or give high-level answers unless you force them to retrieve evidence.
RepoRAG is my attempt at a pragmatic middle ground: a small CLI that builds a FAISS vector index from a local folder or a remote Git repo, then answers questions with retrieval-augmented generation and file-path citations so you can verify quickly.
This post explains the design, the tradeoffs, and what I learned building a repo-native RAG tool that’s meant to be practical, not magical.
What RepoRAG is (and what it isn’t)
RepoRAG is a CLI that:
Ingests a local folder into a FAISS index
Ingests a remote Git repo (clone → index → delete clone)
Answers questions against the saved index with grounded answers + source file paths
Uses file filters + smaller embedding models to speed up large repo ingest
It’s not trying to be a full agent framework, a PR reviewer, or an IDE plugin (yet). It’s intentionally a simple workflow you can run in a terminal when you need answers that are traceable back to the repo.
Why file-path citations are the “killer feature”
A lot of RAG demos show citations as chunks of text. For code, that’s not enough. What I actually want is:
a short answer, and
a shortlist of files that I should open next.
RepoRAG is optimized around that: answers are grounded and include source file paths (so I can jump into the code immediately). This changes the UX from “trust the model” to “use the model to triage where to look.”
High-Level Architecture
At a high level, RepoRAG is the standard RAG loop applied to code:
Collect documents (files from a local folder or a cloned repo)
Chunk documents
Embed chunks
Store in FAISS
On question: retrieve top-k chunks
Generate an answer constrained by retrieved context
Return answer + file-path citations
Repo structure and why it matters
RepoRAG keeps the layout intentionally small and readable. The project structure (as documented) separates the CLI entry point, configuration, ingestion, retrieval, and remote handling.
rag_cli.py– CLI entry pointreporag/– config, loaders, ingest, rag, remote.reporag_index/– this is the vector index, created after ingest
Workflow: local ingest, remote ingest, and ask
RepoRAG exposes three core user actions:
1. Index a local folder
python3 rag_cli.py ingest .
This is the “index whatever is here” workflow.
2. Index a remote repo (clone → index → delete clone)
python3 rag_cli.py ingest-remote https://github.com/langgenius/dify.git --depth 1
This is useful when you want to interrogate a repo you haven’t checked out. RepoRAG’s documented behavior is to delete the cloned repo directory after indexing. --depth 1 means a shallow clone of the repo with only the latest commit history (roughly: the newest snapshot). It downloads far less git history, so cloning is faster and smaller.
3) Ask questions (with k)
python3 rag_cli.py ask "Where is the LLM entry point?"
python3 rag_cli.py ask "How is authentication handled end-to-end?" --k 10
--k 10 is the number of retrieved chunks RepoRAG pulls from the FAISS index to use as context for answering. The --k control matters: it’s the simplest knob for trading off recall vs noise.
Provider abstraction: OpenAI and local options
A core design goal is portability: the LLM + embeddings provider can be swapped via reporag/config.py, and the examples use OpenAI but you can swap providers via config/env.
An example .env using OpenAI includes:
LLM_PROVIDER=openaiOPENAI_API_KEY=...OPENAI_EMBED_MODEL=text-embedding-3-smalloptional OPENAI_EMBED_DIM=768OPENAI_CHAT_MODEL=gpt-4o-miniINDEX_DIR=.reporag_index
Even if you run locally (Ollama, etc.), this split stays the same conceptually: you want an embeddings model that’s fast and a chat model that’s good at “answer from evidence.”
The real engineering decisions in RepoRAG
Most RepoRAG concepts live or die on boring details. Here are the big ones.
1. Chunking strategy: code isn’t prose
Chunking code is tricky because semantics are often non-local:
A function signature in one file, implementation elsewhere
A config key defined in a
.yamland read in PythonAuth flows spanning middleware + routes + handlers
A naive chunking strategy can either:
fragment important context, or
create huge chunks that embed slowly and retrieve noisily
RepoRAG explicitly calls out chunk size/overlap as a performance lever (increase chunk size / reduce overlap to reduce total chunks). That’s not just performance - changing chunking affects retrieval accuracy. Bigger chunks can improve “contain the full idea,” but can also increase irrelevant matches.
Practical heuristic: start with conservative chunking (smaller, more overlap) for correctness; once it works, tune chunking for speed.
2. Retrieval k: the simplest knob that matters
RepoRAG supports increasing k (example shows --k 10).
Low
k→ cleaner context, but higher chance you miss the key fileHigh
k→ better recall, but more distractions (and longer prompts)
In practice, I like:
k=4–6for “where is X implemented?”k=8–12for “explain end-to-end flow”
3. Index location and reproducibility
RepoRAG stores the index under a configurable directory (by default INDEX_DIR=.reporag_index).
That is a good default because:
it keeps the index near the repo
it’s easy to delete and rebuild
it doesn’t pollute global state
Note: You would need to rebuild the index when embedding model or dimensions change (delete .reporag_index and re-ingest). This is critical as embedding dimension mismatches will break retrieval in subtle ways.
4. Large repo runtime performance: embeddings dominate
For large repos, embedding time dominates ingestion.
Suggested optimizations include:
Use a smaller embedding model (example:
text-embedding-3-small)Optionally reduce embedding dimensions (example:
OPENAI_EMBED_DIM=768)Restrict indexing to key folders (example list:
["api","web","docs"])Tune chunk size / overlap to reduce chunk count
Security: your index is sensitive
RepoRAG includes a security note that’s easy to ignore but extremely important:
ingest-remoteargument deletes the cloned repo directory after indexingbut the FAISS index contains embedded chunks of text used for retrieval
treat the index as sensitive if the repo is private
In practice, this implies:
Don’t casually upload
.reporag_indexto public storageDon’t email the index around
Consider excluding it from git
Consider encrypting at rest if you store it somewhere shared
What “good” looks like: grounded answers + fast verification
When RepoRAG is working well, the interaction looks like:
I ask a question
RepoRAG returns a short explanation
It gives me a few file paths
I open those files and confirm quickly
These file-path citations are important. They turn the LLM from an oracle into a navigation assistant.
Failure modes I expect (and how I think about them)
Even without fancy evaluation harnesses, RepoRAG systems fail in predictable ways:
1. “The index doesn’t contain it”
You filtered out the folder that contains the answer so as to trade-off ingestion speed for performance
The file type wasn’t loaded
The repo changed since ingest
Fix: re-ingest; relax filters; ensure key folders are included.
2. “Retrieved chunks are close, but not the answer”
Common with:
generic naming (
utils.py,helpers.ts)repeated patterns (auth middleware in multiple apps)
generated code
Fix: increase k; tighten chunking; add metadata signals (future improvement).
3. “LLM answers beyond the evidence”
This is the classic RAG hallucination problem.
Fix: prompt contract: answer only from retrieved context; if insufficient, say so and cite what you do have.
What I would improve next (if I keep iterating)
These are not any commitments - just the natural next steps for RepoRAG:
Evaluation loop: a small curated Q/A set per repo (“golden questions”) + regression checks
Hybrid retrieval: BM25 + vectors (code identifiers and exact strings matter a lot)
Better citations: include file path + line ranges when possible
Incremental indexing: only re-embed changed files
Tracing: log retrieval hits, chunk IDs, prompt length, latency (so tuning becomes data-driven)
Try it yourself
RepoRAG is on GitHub:
https://github.com/hisahoo009/RepoRAG
Quick start (local folder)
git clone https://github.com/hisahoo009/RepoRAG.git
cd RepoRAG
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
cp .env.example .env
python3 rag_cli.py ingest .
python3 rag_cli.py ask "Where is the backend entry point?"
Index a remote repo (example)
python3 rag_cli.py ingest-remote https://github.com/OpenPipe/ART.git --depth 1
python3 rag_cli.py ask "How do I run ART ?"
