Medical Text Classification

Score: 88/100 — Evaluated by industry expert

Executive Summary

Clinical documentation — discharge summaries, consultation notes, diagnostic reports — accounts for millions of documents processed daily across healthcare systems. Manual routing of these notes to the correct medical department is slow, error-prone, and contributes to delayed patient care. This project builds an NLP classifier that automatically categorizes clinical notes into medical specialties, demonstrating a progression from classical baselines to transformer-based models.

Business Problem

A healthcare system processes 2,000+ clinical notes daily across five departments:

Current state:

Manual triage by administrative staff with limited medical vocabulary
Average routing time: 15-30 minutes per note
8-12% misrouting rate, leading to delayed consultations
Annual cost of misrouting estimated at $500K+ (delayed care, rework)

Goal: Automate clinical note classification with >85% accuracy, reducing routing errors and enabling faster specialist referrals.

Target Categories

Approach

Step	Method	Rationale
Text Preprocessing	Lowercasing, regex cleaning, whitespace normalization	Reduces vocabulary noise without losing clinical signal
Feature Extraction	TF-IDF with bigrams (max 5,000 features)	Captures multi-word medical terms (“chest pain”, “heart failure”)
Baseline	TF-IDF + Logistic Regression	Standard NLP baseline — fast, interpretable, surprisingly strong
Improvement 1	TF-IDF + Linear SVM	Max-margin objective works well in high-dimensional sparse spaces
Improvement 2	BiLSTM + GloVe	Sequence modeling captures word order and context
Improvement 3	BERT fine-tuned	Pre-trained on biomedical text; understands medical language

Results

Model	Accuracy	Macro F1	Weighted F1
TF-IDF + Logistic Regression (baseline)	0.78	0.76	0.78
TF-IDF + Linear SVM	0.81	0.79	0.81
BiLSTM + GloVe	0.84	0.82	0.84
BERT fine-tuned	0.89	0.88	0.89

Improvement over baseline: +11% accuracy, +12% F1

Key Findings

Domain vocabulary dominates: The top 5 TF-IDF features per category are almost exclusively domain-specific terms
Bigrams matter: “chest pain” is highly discriminative; “chest” and “pain” individually are not
BERT’s contextual understanding: Distinguishes “patient failed to respond” (neurology) from “heart failure” (cardiology) — TF-IDF cannot

Evaluation Criteria (Graded)

Criteria	Score	Notes
Feature Engineering	87%	TF-IDF with bigrams, clinical text preprocessing
Baseline Model Training	90%	Logistic regression with proper stratified split
Model Evaluation	88%	Per-class F1, confusion matrix, error analysis
Improvements over Baseline	88%	Clear progression from TF-IDF → SVM → BiLSTM → BERT
Final Running Code	87%	Clean pipeline, reproducible

Limitations and Next Steps

Synthetic data: Real clinical notes have far more variability (abbreviations, typos, mixed languages)
PHI handling: Production deployment requires PII/PHI detection and redaction (Presidio)
More categories: Real systems route to 20+ specialties, not 5
Confidence thresholds: Low-confidence predictions should be routed to human review
Active learning: Use misclassified notes to improve the model iteratively

How to Run

pip install -r requirements.txt

# Full analysis (recommended)
jupyter notebook medical_text_classification.ipynb

# Or run the pipeline scripts
python src/preprocess.py    # Preprocess clinical notes
python src/train.py         # Train all models

Tech Stack

Python, scikit-learn, TF-IDF, PyTorch, HuggingFace Transformers, BERT, Matplotlib