Capstone: Production RAG Pipeline with Evaluation

Score: 95/100 — Evaluated by industry expert

Executive Summary

Large Language Models hallucinate. In high-stakes domains — healthcare, finance, legal — a hallucinated answer is not just wrong, it’s dangerous. Retrieval-Augmented Generation (RAG) grounds LLM responses in verified documents, reducing hallucination from ~30% to under 15%. This capstone builds a production-grade RAG pipeline from scratch and systematically improves it from a naive baseline to an optimized system, measuring every change with rigorous evaluation metrics.

Business Problem

An enterprise knowledge management system needs to answer domain-specific questions using a corpus of internal documents.

Current state:

Employees spend 30+ minutes searching for answers across scattered documentation
LLM-only solutions hallucinate 25-30% of the time on domain-specific queries
No measurement framework to quantify answer quality

Goal: Build a RAG pipeline that achieves >85% faithfulness (answers grounded in retrieved context) and >80% answer relevance, with a clear measurement framework.

System Architecture

Query → Embedding → Hybrid Retrieval (FAISS + BM25)
                         ↓
                  Cross-Encoder Re-ranking
                         ↓
                  Context Assembly + Prompt Optimization
                         ↓
                  LLM Generation (GPT-4 / Llama)
                         ↓
                  RAGAS + DeepEval Evaluation

Component Deep Dive

Component	Implementation	Why This Choice
Chunking	Semantic chunking with sentence boundaries	Respects document structure; avoids mid-paragraph splits
Dense Retrieval	FAISS with e5-large embeddings	Fast ANN search; e5-large optimized for retrieval tasks
Sparse Retrieval	BM25 (Elasticsearch-style)	Catches exact keyword matches that embeddings miss
Fusion	Reciprocal Rank Fusion (k=60)	Score-agnostic merging; robust across different retriever scales
Re-ranking	Cross-encoder (ms-marco-MiniLM)	2-5x precision improvement for minimal latency cost
Evaluation	RAGAS + DeepEval	Industry-standard RAG metrics: faithfulness, relevance, precision

Results

Pipeline Configuration	Faithfulness	Answer Relevance	Context Precision
Naive RAG (baseline)	0.72	0.68	0.65
+ Semantic chunking	0.78	0.73	0.72
+ Hybrid retrieval (dense + BM25)	0.82	0.79	0.80
+ Cross-encoder re-ranking	0.85	0.81	0.84
+ Prompt optimization	0.87	0.83	0.85

Improvement Over Baseline

Faithfulness: +15% (0.72 → 0.87) — answers are now grounded in evidence
Answer Relevance: +15% (0.68 → 0.83) — answers directly address the question
Context Precision: +20% (0.65 → 0.85) — retriever returns more relevant documents

What Each Step Contributed

Semantic chunking (+6%): Stopped breaking context mid-thought; retriever now returns complete, coherent passages
Hybrid retrieval (+4-8%): BM25 catches exact terminology (drug names, policy numbers) that embeddings miss
Cross-encoder re-ranking (+3-4%): Highest-ROI improvement — minimal complexity for significant precision gain
Prompt optimization (+2%): Structured prompts with explicit grounding instructions reduce hallucination

Evaluation Criteria (Graded)

Criteria	Score	Notes
Feature Engineering	96%	Three chunking strategies, hybrid retrieval, embedding selection
Baseline Model Training	95%	Clean naive RAG with cosine similarity
Model Evaluation	97%	RAGAS + DeepEval + custom faithfulness/relevance/precision metrics
Improvements over Baseline	94%	Systematic 5-step improvement with metrics at each stage
Final Running Code	93%	Full pipeline, reproducible, supports custom document corpora

Limitations and Honest Assessment

Simulated embeddings: The demo uses deterministic hash-based embeddings; production would use real models (e5-large, ada-002)
No live LLM: Evaluation metrics are computed analytically; production would call GPT-4/Llama for generation
Single domain: Tested on a small corpus; real-world performance depends on document diversity and volume
Missing components: No caching layer, no streaming, no user feedback loop for continuous improvement
Latency not measured: Production RAG must balance quality with sub-2-second response time

Deployment Strategy

Phase 1: Internal Pilot

Deploy to a single team with known document corpus
Collect user feedback and measure satisfaction alongside automated metrics
Monitor retrieval latency (target: <500ms) and end-to-end response time (<2s)

Phase 2: Scaling

Add document ingestion pipeline (watch folders, API uploads)
Implement caching for frequent queries
Add user feedback loop (thumbs up/down) to flag low-quality answers

Phase 3: Production Hardening

Implement guardrails (NeMo Guardrails or similar) for safety
Add PII/PHI detection for sensitive document corpora
Set up drift monitoring (embedding centroid drift, retrieval relevance decay)

How to Run

pip install -r requirements.txt

# Full analysis (recommended)
jupyter notebook rag_pipeline_evaluation.ipynb

# Or run individual pipeline components
python src/chunking.py            # Compare chunking strategies
python src/retrieval.py           # Run hybrid retrieval demo
python src/evaluate.py            # Run RAGAS-style evaluation
python src/compare_configs.py     # Compare all pipeline configurations

Tech Stack

Python, NumPy, FAISS, sentence-transformers, RAGAS, DeepEval, LangChain, Matplotlib