A notebooks for evaluating retrieval quality for Contextual AI RAG agents using synthetic multi-hop question dataset.
This pipeline provides an end-to-end solution for rigorously evaluating RAG systems by:
- Generating realistic multi-hop questions from your documents
- Matching ground-truth evidence to retrieved chunks automatically
- Evaluating retrieval performance using both traditional IR and modern RAG-specific metrics
A video walkthrough of these notebooks is available.
📁 RAG Evaluation Pipeline/
├── 📁 Data # A set of sample PDFs
├── 📓 Retrieval_GenerateMultiHop.ipynb # Step 1: Generate multi-hop QA pairs
├── 📓 Retrieval_Matching.ipynb # Step 2: Match evidence to chunks
├── 📓 Retrieval_Evaluation.ipynb # Step 3: Evaluate retrieval performance
└── 📄 README.md # This file
- API Keys Required:
- Claude API key (for question generation)
- Contextual AI API key (for retrieval evaluation)
- Python Environment: Google Colab or Jupyter with internet access
- Documents: 2+ PDF files for optimal multi-hop question generation
Run on Google Colab
Input: PDF documents
Output: qa_pairs_YYYYMMDD_HHMMSS.xlsx (structured Q&A with evidence)
Runtime: ~15-30 minutes for 100 questions
Provided Example: qa_pairs_multi_row_20250616_174936.xlsx
Input: Excel file from Step 1
Output: matched_retrieval_Jun20.csv (evidence matched to chunk IDs)
Runtime: ~2-3 seconds per evidence string
Provided Example: matched_retrievals
Input: CSV file from Step 2
Output: eval_results_final.csv (comprehensive metrics)
Provided Example: eval_results_final.csv