Capstone Research Project | University of Virginia, School of Data Science
Authors: Vishwanath Guruvayur, Luke Napolitano, Doruk Ozar, Bereket Tafesse
Sponsors & Mentors: Dr. Brian Wright (UVA), Lucas McCabe (LMI Inc.), Dr. Brant Horio (LMI Inc.), Ali Rivera (UVA)
This project explores how multimodal retrieval (text, images, and audio) can enhance the performance of Retrieval-Augmented Generation (RAG) models, specifically within the context of undergraduate machine learning education.
We investigated:
- Does adding image-based data improve RAG performance?
- What are the computational trade-offs?
- How does query specificity impact retrieval and response quality?
Our work contributes toward smarter, more context-aware educational AI systems.
- Support active learning by enhancing AI chatbot capabilities for educational settings.
- Apply multimodal RAG (Retrieval-Augmented Generation) on a real-world dataset — course materials from DS3001: Foundations of Machine Learning.
- Evaluate if adding visual elements to text retrieval leads to better context recall, faithfulness, and factual correctness.
Sources:
- DS3001 lecture slides (text + images)
- Audio lectures (converted to text)
- ML research papers and textbooks (text + images)
Embeddings:
- Text: SentenceTransformers (
all-mpnet-base-v2) → 768-dimensional vectors - Image: OpenAI CLIP (
clip-vit-base-patch32) → 512-dimensional vectors - Stored embeddings in Pinecone DB and images in MongoDB.
-
Storage Pipelines: Separated pipelines for text and image embedding & retrieval.
-
User Pipelines: Designed for flexible retrieval with different input configurations (text-only, text+images).
-
Experimental Setup:
- Zero-shot manual prompting and automated evaluation
- Clustered embeddings using HDBScan and KMeans after PCA
- Multiple configurations: 10 Text Vectors, 5 Text + 5 Image, 10 Text + 10 Image vectors
-
Evaluation Metrics (via RAGAS):
- Context Recall
- Faithfulness
- Factual Correctness
-
Bootstrapped Testing: 50 questions × 10 iterations for robustness.
- Adding images improved LLM response quality for general questions.
- Specific questions benefited more from highly scoped contexts rather than extra images.
- Too many images could decrease recall slightly, suggesting a need for careful balancing.
- Zero-shot models performed reasonably well on generic ML questions.
Smart Agentic RAG (Future Direction):
- Dynamically analyze query specificity.
- Adjust number of text and image vectors retrieved based on need.
- Integrate knowledge graphs to cluster concepts and further improve retrieval relevance.
Challenges Ahead:
- Evaluation remains computationally expensive.
- Smart context pruning needs more exploration.
We are grateful to our mentors and sponsors:
- Dr. Brian Wright (University of Virginia)
- Lucas McCabe (LMI Inc.)
- Dr. Brant Horio (LMI Inc.)
- Ali Rivera (University of Virginia)
Special thanks to the UVA School of Data Science and LMI Inc. for supporting this research!
➡️ Interactive Embedding Visualization of Vector DB Retrieval