Skip to the content.

Capstone: Production RAG Pipeline with Evaluation

Score: 95/100 — Evaluated by industry expert

Executive Summary

Large Language Models hallucinate. In high-stakes domains — healthcare, finance, legal — a hallucinated answer is not just wrong, it’s dangerous. Retrieval-Augmented Generation (RAG) grounds LLM responses in verified documents, reducing hallucination from ~30% to under 15%. This capstone builds a production-grade RAG pipeline from scratch and systematically improves it from a naive baseline to an optimized system, measuring every change with rigorous evaluation metrics.

Business Problem

An enterprise knowledge management system needs to answer domain-specific questions using a corpus of internal documents.

Current state:

Goal: Build a RAG pipeline that achieves >85% faithfulness (answers grounded in retrieved context) and >80% answer relevance, with a clear measurement framework.

System Architecture

Query → Embedding → Hybrid Retrieval (FAISS + BM25)
                         ↓
                  Cross-Encoder Re-ranking
                         ↓
                  Context Assembly + Prompt Optimization
                         ↓
                  LLM Generation (GPT-4 / Llama)
                         ↓
                  RAGAS + DeepEval Evaluation

Component Deep Dive

Component Implementation Why This Choice
Chunking Semantic chunking with sentence boundaries Respects document structure; avoids mid-paragraph splits
Dense Retrieval FAISS with e5-large embeddings Fast ANN search; e5-large optimized for retrieval tasks
Sparse Retrieval BM25 (Elasticsearch-style) Catches exact keyword matches that embeddings miss
Fusion Reciprocal Rank Fusion (k=60) Score-agnostic merging; robust across different retriever scales
Re-ranking Cross-encoder (ms-marco-MiniLM) 2-5x precision improvement for minimal latency cost
Evaluation RAGAS + DeepEval Industry-standard RAG metrics: faithfulness, relevance, precision

Results

Pipeline Configuration Faithfulness Answer Relevance Context Precision
Naive RAG (baseline) 0.72 0.68 0.65
+ Semantic chunking 0.78 0.73 0.72
+ Hybrid retrieval (dense + BM25) 0.82 0.79 0.80
+ Cross-encoder re-ranking 0.85 0.81 0.84
+ Prompt optimization 0.87 0.83 0.85

Improvement Over Baseline

What Each Step Contributed

  1. Semantic chunking (+6%): Stopped breaking context mid-thought; retriever now returns complete, coherent passages
  2. Hybrid retrieval (+4-8%): BM25 catches exact terminology (drug names, policy numbers) that embeddings miss
  3. Cross-encoder re-ranking (+3-4%): Highest-ROI improvement — minimal complexity for significant precision gain
  4. Prompt optimization (+2%): Structured prompts with explicit grounding instructions reduce hallucination

Evaluation Criteria (Graded)

Criteria Score Notes
Feature Engineering 96% Three chunking strategies, hybrid retrieval, embedding selection
Baseline Model Training 95% Clean naive RAG with cosine similarity
Model Evaluation 97% RAGAS + DeepEval + custom faithfulness/relevance/precision metrics
Improvements over Baseline 94% Systematic 5-step improvement with metrics at each stage
Final Running Code 93% Full pipeline, reproducible, supports custom document corpora

Limitations and Honest Assessment

Deployment Strategy

Phase 1: Internal Pilot

Phase 2: Scaling

Phase 3: Production Hardening

How to Run

pip install -r requirements.txt

# Full analysis (recommended)
jupyter notebook rag_pipeline_evaluation.ipynb

# Or run individual pipeline components
python src/chunking.py            # Compare chunking strategies
python src/retrieval.py           # Run hybrid retrieval demo
python src/evaluate.py            # Run RAGAS-style evaluation
python src/compare_configs.py     # Compare all pipeline configurations

Tech Stack

Python, NumPy, FAISS, sentence-transformers, RAGAS, DeepEval, LangChain, Matplotlib