🧠 Using Educational Data to Explore Multimodal (Audio, Visual, & Textual) LLM Retrieval Techniques

Capstone Research Project | University of Virginia, School of Data Science

Authors: Vishwanath Guruvayur, Luke Napolitano, Doruk Ozar, Bereket Tafesse
Sponsors & Mentors: Dr. Brian Wright (UVA), Lucas McCabe (LMI Inc.), Dr. Brant Horio (LMI Inc.), Ali Rivera (UVA)

📚 Project Overview

This project explores how multimodal retrieval (text, images, and audio) can enhance the performance of Retrieval-Augmented Generation (RAG) models, specifically within the context of undergraduate machine learning education.

We investigated:

Does adding image-based data improve RAG performance?
What are the computational trade-offs?
How does query specificity impact retrieval and response quality?

Our work contributes toward smarter, more context-aware educational AI systems.

🎯 Purpose and Motivation

Support active learning by enhancing AI chatbot capabilities for educational settings.
Apply multimodal RAG (Retrieval-Augmented Generation) on a real-world dataset — course materials from DS3001: Foundations of Machine Learning.
Evaluate if adding visual elements to text retrieval leads to better context recall, faithfulness, and factual correctness.

📊 Dataset and Modalities

Sources:

DS3001 lecture slides (text + images)
Audio lectures (converted to text)
ML research papers and textbooks (text + images)

Embeddings:

Text: SentenceTransformers (all-mpnet-base-v2) → 768-dimensional vectors
Image: OpenAI CLIP (clip-vit-base-patch32) → 512-dimensional vectors
Stored embeddings in Pinecone DB and images in MongoDB.

🔬 Research Methodology

Storage Pipelines: Separated pipelines for text and image embedding & retrieval.
User Pipelines: Designed for flexible retrieval with different input configurations (text-only, text+images).
Experimental Setup:
- Zero-shot manual prompting and automated evaluation
- Clustered embeddings using HDBScan and KMeans after PCA
- Multiple configurations: 10 Text Vectors, 5 Text + 5 Image, 10 Text + 10 Image vectors
Evaluation Metrics (via RAGAS):
- Context Recall
- Faithfulness
- Factual Correctness
Bootstrapped Testing: 50 questions × 10 iterations for robustness.

📈 Key Results

Adding images improved LLM response quality for general questions.
Specific questions benefited more from highly scoped contexts rather than extra images.
Too many images could decrease recall slightly, suggesting a need for careful balancing.
Zero-shot models performed reasonably well on generic ML questions.

💡 Innovations and Future Work

Smart Agentic RAG (Future Direction):

Dynamically analyze query specificity.
Adjust number of text and image vectors retrieved based on need.
Integrate knowledge graphs to cluster concepts and further improve retrieval relevance.

Challenges Ahead:

Evaluation remains computationally expensive.
Smart context pruning needs more exploration.

🙏 Acknowledgements

We are grateful to our mentors and sponsors:

Dr. Brian Wright (University of Virginia)
Lucas McCabe (LMI Inc.)
Dr. Brant Horio (LMI Inc.)
Ali Rivera (University of Virginia)

Special thanks to the UVA School of Data Science and LMI Inc. for supporting this research!

🌐 Live Visualization

➡️ Interactive Embedding Visualization of Vector DB Retrieval

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
Draft Pipeline		Draft Pipeline
Embedding		Embedding
Evaluation		Evaluation
Individual Work		Individual Work
Project Proposal Docs		Project Proposal Docs
Testing		Testing
data		data
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Using Educational Data to Explore Multimodal (Audio, Visual, & Textual) LLM Retrieval Techniques

📚 Project Overview

🎯 Purpose and Motivation

📊 Dataset and Modalities

🔬 Research Methodology

📈 Key Results

💡 Innovations and Future Work

🙏 Acknowledgements

🌐 Live Visualization

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

🧠 Using Educational Data to Explore Multimodal (Audio, Visual, & Textual) LLM Retrieval Techniques

📚 Project Overview

🎯 Purpose and Motivation

📊 Dataset and Modalities

🔬 Research Methodology

📈 Key Results

💡 Innovations and Future Work

🙏 Acknowledgements

🌐 Live Visualization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages