Why I Built This Tutorial Series
Large Language Models are powerful, but they hallucinate. Retrieval-Augmented Generation (RAG) fixes that by grounding every answer in real documents. The problem? Most RAG guides stop at “embed, retrieve, generate” and leave you guessing whether your pipeline actually improved.
I wanted something different: a single, progressive codebase where each tutorial changes exactly one variable, runs against the same dataset, and produces comparable metrics. That way you can see the impact of semantic chunking, reranking, hybrid retrieval, and even agentic patterns — not just read about them.
The result is the all-things-rag repository: eight Jupyter notebooks, a shared Python library, and a common evaluation framework. This post walks you through every tutorial, explains the why behind each technique, and gives you everything you need to run the code yourself.
Who This Guide Is For
- Beginners who have heard of RAG but have never built one.
- Practitioners who have a basic pipeline and want to know what to improve next.
- Engineers curious about how agentic patterns (ReAct, reflection, state management) extend RAG.
No prior experience with embeddings or vector databases is required. Familiarity with Python and basic command-line usage is enough to follow along.
How the Tutorials Are Organized
The series is split into two parts. Part 1 (Tutorials 1–5) covers RAG fundamentals — the retrieval pipeline itself. Part 2 (Tutorials 6–8) covers the agent extension — wrapping the RAG pipeline as a tool that an autonomous agent can call.
Part 1: RAG Fundamentals
Tutorial 1 → Tutorial 2 → Tutorial 3 → Tutorial 4 → Tutorial 5
(Baseline) (Chunking) (Reranking) (Hybrid) (Benchmark)
Part 2: Agent Extension
Tutorial 6 → Tutorial 7 → Tutorial 8
(ReAct) (Reflection) (State Mgmt)
Every RAG tutorial reuses the same:
- Domain scenario — an international work policy handbook.
- Query set — intentionally varied questions that stress-test retrieval.
- Evaluation metrics — Recall@k, MRR, groundedness, and latency.
This means the only thing that changes between tutorials is the technique, making results directly comparable.
Setting Up Your Environment
Before diving into the tutorials, let’s get the repository running locally. You will need:
- Python 3.11 (the repo pins 3.11.13)
- uv — a fast Python package manager (install guide)
- An OpenAI API key — used for embeddings and answer generation
Step-by-step setup
# 1. Clone the repository
git clone https://github.com/nmadhire-agents/all-things-rag.git
cd all-things-rag
# 2. Create your environment file
cp .env.example .env
# Edit .env and set your OPENAI_API_KEY
# 3. Install dependencies
uv sync
# 4. Generate the shared dataset (only needed once)
uv run python scripts/generate_data.py
# 5. Launch Jupyter
uv run jupyter labOpen the tutorials/ folder inside Jupyter and work through the notebooks in order.
What the dataset looks like
The generator creates two files:
| File | Contents |
|---|---|
data/documents.jsonl |
Handbook sections with fields doc_id, title, section, and text |
data/queries.jsonl |
Evaluation queries with question, target_doc_id, target_section, and rationale |
Every tutorial reads from these same files, so results are always apples-to-apples.
Part 1: RAG Fundamentals
Tutorial 1 — Dense Retrieval Baseline (Fixed Chunks)
Notebook: tutorials/01_basic_rag.ipynb
What you build: A complete, end-to-end RAG pipeline — the simplest version that actually works.
The core idea: RAG has three stages: chunk the documents, embed them into vectors, and retrieve the most relevant chunks when a user asks a question. An LLM then generates an answer grounded in the retrieved context.
How it works step by step:
- Load documents — The handbook text is parsed into section-level documents.
- Fixed-width chunking — Each document is split into 260-character segments. This is the simplest chunking strategy: just slice the text at regular intervals.
- Embed chunks — Each chunk is sent to OpenAI’s
text-embedding-3-smallmodel, which returns a 1536-dimensional vector. The notebook shows you the actual matrix shape and sample vector values. - Index in ChromaDB — The vectors are stored in a persistent ChromaDB collection so you can query them later.
- Retrieve — For a user question, the question is embedded with the same model, and cosine similarity finds the closest chunks.
- Generate — The top-k retrieved chunks are formatted into a prompt, and an LLM produces a grounded answer with chunk citations.
What you see in the notebook:
- The embedding matrix shape (e.g.,
(42, 1536)) and what individual vector values look like. - A cosine similarity scoring example that demystifies “semantic search.”
- A top-k retrieval table showing chunk IDs, similarity scores, and text snippets.
What breaks and why: Fixed-width chunking blindly cuts text at character boundaries. A policy rule and its exception might end up in different chunks. When the user asks “What is the policy for working from another country?”, the pipeline over-retrieves generic remote-work chunks instead of the specific international policy section.
Key takeaway: This baseline works, but chunk boundary problems hurt retrieval quality. That sets up Tutorial 2.
Tutorial 2 — Semantic Chunking
Notebook: tutorials/02_semantic_chunking.ipynb
What changes: Only the chunking strategy. Everything else (embeddings, vector store, LLM) stays the same.
The core idea: Instead of slicing text at arbitrary character counts, group sentences that belong together semantically. If a policy rule and its exception are in consecutive sentences, keep them in the same chunk.
How the semantic chunker works:
The implementation takes a lightweight, practical approach:
- Split the document text on sentence boundaries (
.delimiters). - If there are two or fewer sentences, keep them as a single chunk.
- Otherwise, group the first two sentences into one chunk and the remaining sentences into another.
This is intentionally simple — the goal is to show the impact of keeping related content together, not to build the most sophisticated chunker possible.
What you see in the notebook:
- A side-by-side visualization of fixed vs. semantic chunk boundaries on the same handbook section.
- Retrieval results for the same queries, now with better coverage of policy rules and their exceptions.
Expected improvement: Questions where rules and exceptions must stay together see noticeable retrieval gains. The fragmentation problem from Tutorial 1 is reduced.
Key takeaway: Better chunking improves retrieval quality without changing anything else in the pipeline. But even with good chunks, the initial ranking can still be noisy — which leads to Tutorial 3.
Tutorial 3 — Two-Stage Retrieval with Reranking
Notebook: tutorials/03_reranking.ipynb
What changes: A second-stage reranker is added after dense retrieval. Chunking and embedding remain the same.
The core idea: Dense retrieval (cosine similarity on embeddings) is fast but approximate. It casts a wide net and retrieves candidates that are roughly relevant. A cross-encoder reranker then scores each candidate pairwise against the query — much more expensive, but much more precise.
Think of it as a two-round interview process: the first round screens résumés quickly, the second round does in-depth interviews on the shortlisted candidates.
How the reranker works:
The tutorial uses a local cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2) from the sentence-transformers library:
- Dense retrieval returns the top candidates (e.g., top 20).
- The cross-encoder scores each
(query, chunk)pair using full attention — it sees the query and chunk together, not as separate embeddings. - Results are re-sorted by cross-encoder score and the new top-k is returned.
What you see in the notebook:
- A before/after rank movement table showing how chunks shift position after reranking.
- The quality-latency tradeoff: reranking is slower but produces better top-ranked context.
Expected improvement: The most relevant chunk is more likely to appear at rank 1 after reranking, even when initial vector search rankings are fuzzy.
Key takeaway: First-stage retrieval gets you into the right neighborhood; reranking gets you to the right door. But there is still a class of queries that dense retrieval misses entirely — Tutorial 4 addresses that.
Tutorial 4 — Hybrid Retrieval (Dense + BM25)
Notebook: tutorials/04_hybrid_search.ipynb
What changes: BM25 keyword retrieval is added alongside dense retrieval, and results are fused using Reciprocal Rank Fusion (RRF).
The core idea: Dense (embedding-based) retrieval understands meaning but can miss exact terms. Keyword (BM25) retrieval matches exact tokens but misses semantics. Hybrid retrieval combines both to get the best of each.
A concrete example: if a policy mentions “Form A-12”, a user searching for that exact form ID needs keyword matching. Embedding models may not reliably distinguish Form A-12 from Form B-7 because their semantic meaning is similar. BM25, on the other hand, matches the exact token string.
How hybrid retrieval works:
- Dense path — Same embedding-based retrieval as before.
- Keyword path — A BM25 index (
rank-bm25library) is built from chunk texts. Queries are tokenized and scored using the Okapi BM25 algorithm. - Reciprocal Rank Fusion — Dense and keyword rankings are combined using the RRF formula:
\[\text{RRF\_score}(d) = \sum_{r \in \text{rankings}} \frac{1}{k + \text{rank}_r(d)}\]
where \(k = 60\) is a smoothing constant. This gives credit to chunks that appear high in either ranking, without needing to normalize scores across different retrieval methods.
What you see in the notebook:
- Dense vs. keyword vs. hybrid comparisons on the
Form A-12example. - How RRF rebalances rankings when one method finds something the other misses.
Expected improvement: Queries involving exact identifiers, codes, or form names are now handled reliably alongside conceptual questions.
Key takeaway: No single retrieval method is best for all queries. Hybrid fusion provides robustness across query types.
Tutorial 5 — Benchmark All Four Variants
Notebook: tutorials/05_rag_comparison.ipynb
What changes: Nothing new is introduced. This tutorial runs all four RAG variants side by side on the same query set and produces a single benchmark table.
The core idea: Architecture decisions should be driven by measured outcomes, not intuition. This tutorial gives you the numbers.
Evaluation metrics explained:
| Metric | What it measures | How it’s computed |
|---|---|---|
| Recall@k | Did the correct document appear in the top-k results? | Binary: 1 if the target document’s chunk is in the top-k, 0 otherwise |
| MRR | How high did the correct document rank? | 1 / rank of the first correct hit |
| Groundedness | Is the generated answer grounded in the retrieved context? | Lexical overlap between answer tokens and context tokens |
| Latency (ms) | How fast is the pipeline end to end? | Wall-clock time for retrieval + generation |
What you see in the notebook:
- A comparison table with all four variants scored across all four metrics.
- Plots that visualize the tradeoffs (e.g., reranking improves quality but adds latency).
- A clear, data-driven basis for choosing which variant to deploy.
Key takeaway: Tutorial 5 is where you stop and decide which RAG pipeline to take forward. The remaining tutorials build on top of whichever variant you choose.
Part 2: Agent Extension
The agent tutorials wrap the RAG pipeline as a tool that an autonomous agent can call. Instead of the user deciding when and what to retrieve, the agent makes those decisions using structured reasoning patterns.
Tutorial 6 — ReAct Agent
Notebook: tutorials/06_react_agent.ipynb
What you build: An agent that follows the Reason + Act pattern (ReAct) — it thinks about what to do, calls a tool, observes the result, and repeats until it has an answer.
The core idea: In a standard RAG pipeline, the flow is rigid: embed the query → retrieve → generate. A ReAct agent, by contrast, decides at each step whether it needs to retrieve more information, and what query to send.
How the ReAct loop works:
┌─────────────────────────────────────────┐
│ User asks a question │
└──────────────┬──────────────────────────┘
▼
┌─────────────────────────────────────────┐
│ THINK: "I need to look up the │
│ international work policy" │
└──────────────┬──────────────────────────┘
▼
┌─────────────────────────────────────────┐
│ ACT: Call the retrieve tool with │
│ "international work policy" │
└──────────────┬──────────────────────────┘
▼
┌─────────────────────────────────────────┐
│ OBSERVE: Read the retrieved chunks │
└──────────────┬──────────────────────────┘
▼
┌─────────────────────────────────────────┐
│ THINK: "I have enough info to answer" │
│ ACT: finish │
└─────────────────────────────────────────┘
At each step, the agent outputs a JSON object with thought, action, and action_input. The orchestration loop parses this, calls the named tool, and feeds the observation back.
What the agent code does:
- The agent is connected to a set of named tools — any callable that takes a string and returns a string. The RAG retrieval function is the canonical example.
- The LLM is prompted to respond in structured JSON at every turn.
- The loop runs for at most
max_stepscycles before forcing a finish.
What you see in the notebook:
- The full Thought → Action → Observation trace for each question.
- How the agent decides to reformulate its query when the first retrieval is insufficient.
Key takeaway: The agent adds flexibility — it can decide what to search for and when to stop, rather than following a rigid pipeline.
Tutorial 7 — Reflection and Self-Correction
Notebook: tutorials/07_reflection_agent.ipynb
What you build: A Worker-Critic system where one LLM drafts an answer and another reviews it, sending feedback for revision if needed.
The core idea: Even with good retrieval, the LLM might produce an answer that is incomplete or not fully grounded in the context. A second LLM (the Critic) acts as a quality gate, checking the answer against the retrieved context and providing actionable feedback.
How the Worker-Critic loop works:
┌──────────────────────────────────────────┐
│ Worker: Generate draft answer from │
│ retrieved context │
└──────────────┬───────────────────────────┘
▼
┌──────────────────────────────────────────┐
│ Critic: Review the draft │
│ - Is it accurate? │
│ - Is it complete? │
│ - Is it grounded in the context? │
└──────────────┬───────────────────────────┘
▼
┌───────┴────────┐
│ Approved? │
└───┬────────┬───┘
YES NO
│ │
▼ ▼
Return Send feedback
answer to Worker →
Revise and
re-submit
How it works in the code:
- The Worker receives the question + context (and any prior Critic feedback) and generates an answer.
- The Critic receives the question + context + draft answer and outputs a JSON
{approved: bool, feedback: str}. - If not approved, the Worker gets the feedback and tries again.
- The loop runs for at most
max_rounds(default: 3) to prevent infinite cycles. - If the Critic’s response can’t be parsed, it defaults to approved to avoid infinite loops — a practical safety guard.
What you see in the notebook:
- The full round-by-round history: each draft answer, the Critic’s verdict, and the specific feedback given.
- How answers improve across rounds as the Worker incorporates Critic feedback.
Key takeaway: Reflection catches errors that retrieval alone cannot fix. It adds a quality-assurance layer before the answer reaches the user.
Tutorial 8 — State Management (Checkpoints and Time Travel)
Notebook: tutorials/08_state_management.ipynb
What you build: A checkpoint system that lets you save, inspect, and rewind agent runs to any previous step.
The core idea: Agent runs are multi-step processes. When something goes wrong at step 5, you want to go back to step 3 and try a different path — without re-running the entire pipeline. Checkpoints make this possible.
Key concepts:
| Concept | What it does |
|---|---|
| AgentState | Captures the full snapshot of an agent run: the question, all steps so far, current answer, and status |
| Checkpoint | Wraps an AgentState with metadata: a unique ID, step number, and human-readable label |
| StateManager | Stores multiple checkpoints and supports save_checkpoint, load_checkpoint, list_checkpoints, and rewind_to |
How checkpointing works in the code:
from rag_tutorials.agent_state import AgentState, StateManager
manager = StateManager()
state = AgentState(question="How many days of remote work are allowed?")
# After step 1 — save a checkpoint
state.steps.append({"thought": "...", "action": "retrieve", "observation": "..."})
checkpoint_id = manager.save_checkpoint(state, label="after_step_1")
# After step 2 — something went wrong
state.steps.append({"thought": "...", "action": "retrieve", "observation": "bad result"})
# Rewind to the checkpoint taken after step 1
state = manager.rewind_to(checkpoint_id)
# state.steps now has only the step 1 entry — you can try a different pathEvery checkpoint is a deep copy of the state, so rewinding never corrupts the original data.
What you see in the notebook:
- Saving checkpoints at each agent step.
- Listing all checkpoints with their step numbers and labels.
- Rewinding to a previous state and inspecting the difference.
- Replaying from a checkpoint with a modified strategy.
Key takeaway: State management turns agent debugging from a black box into a transparent, reproducible process. You can pause, inspect, and replay any part of the agent’s decision-making.
The Technical Stack
Here’s what powers the tutorials under the hood:
| Component | Technology | Purpose |
|---|---|---|
| Embeddings | OpenAI text-embedding-3-small |
Convert text to vectors |
| Vector store | ChromaDB (persistent) | Store and query embeddings |
| Keyword retrieval | rank-bm25 (Okapi BM25) |
Exact-token matching |
| Reranking | sentence-transformers cross-encoder |
Pairwise relevance scoring |
| Answer generation | OpenAI gpt-4.1-mini |
LLM for answer and agent reasoning |
| Evaluation | Custom metrics module | Recall@k, MRR, groundedness, latency |
| Agent patterns | Custom ReAct, reflection, and state modules | Agentic reasoning |
How to Get the Most Out of This Series
Run the notebooks in order. Each tutorial builds on what came before. Skipping ahead means missing the motivation for why a technique exists.
Read the failure analysis cells. Every notebook includes cells that explain what worked, what failed, and why the next tutorial exists. These “learning checkpoints” are the most valuable part.
Study the evaluation tables. Don’t just glance at the final metrics — look at which queries each variant gets right or wrong. The per-query breakdown reveals patterns that aggregate numbers hide.
Experiment. Change the chunk size in Tutorial 1. Swap the reranker model in Tutorial 3. Adjust the RRF smoothing constant in Tutorial 4. The codebase is designed to make experimentation easy.
Use the benchmarks for decision-making. Tutorial 5 exists so you can make an informed choice about which variant to deploy. The best variant depends on your query distribution, latency budget, and quality requirements.
What You’ll Walk Away With
After completing all eight tutorials, you will have:
- Built a working RAG pipeline from scratch
- Understood four retrieval strategies and their measurable tradeoffs
- Implemented a ReAct agent that uses retrieval as a tool
- Added self-correction with a Worker-Critic feedback loop
- Built a checkpoint system for debugging and replaying agent runs
- A reusable Python library (
rag_tutorials) you can adapt for your own projects
Get Started
The full source code, notebooks, and documentation are available at:
github.com/nmadhire-agents/all-things-rag
Clone the repo, set your API key, and run Tutorial 1. Every technique in this post has a runnable notebook waiting for you.