Skip to content

MSDS-Capstone-Project/MultiModalRAG

Repository files navigation

🧠 Using Educational Data to Explore Multimodal (Audio, Visual, & Textual) LLM Retrieval Techniques

Capstone Research Project | University of Virginia, School of Data Science

Authors: Vishwanath Guruvayur, Luke Napolitano, Doruk Ozar, Bereket Tafesse
Sponsors & Mentors: Dr. Brian Wright (UVA), Lucas McCabe (LMI Inc.), Dr. Brant Horio (LMI Inc.), Ali Rivera (UVA)


📚 Project Overview

This project explores how multimodal retrieval (text, images, and audio) can enhance the performance of Retrieval-Augmented Generation (RAG) models, specifically within the context of undergraduate machine learning education.

We investigated:

  • Does adding image-based data improve RAG performance?
  • What are the computational trade-offs?
  • How does query specificity impact retrieval and response quality?

Our work contributes toward smarter, more context-aware educational AI systems.


🎯 Purpose and Motivation

  • Support active learning by enhancing AI chatbot capabilities for educational settings.
  • Apply multimodal RAG (Retrieval-Augmented Generation) on a real-world dataset — course materials from DS3001: Foundations of Machine Learning.
  • Evaluate if adding visual elements to text retrieval leads to better context recall, faithfulness, and factual correctness.

📊 Dataset and Modalities

Sources:

  • DS3001 lecture slides (text + images)
  • Audio lectures (converted to text)
  • ML research papers and textbooks (text + images)

Embeddings:

  • Text: SentenceTransformers (all-mpnet-base-v2) → 768-dimensional vectors
  • Image: OpenAI CLIP (clip-vit-base-patch32) → 512-dimensional vectors
  • Stored embeddings in Pinecone DB and images in MongoDB.

🔬 Research Methodology

  • Storage Pipelines: Separated pipelines for text and image embedding & retrieval.

  • User Pipelines: Designed for flexible retrieval with different input configurations (text-only, text+images).

  • Experimental Setup:

    • Zero-shot manual prompting and automated evaluation
    • Clustered embeddings using HDBScan and KMeans after PCA
    • Multiple configurations: 10 Text Vectors, 5 Text + 5 Image, 10 Text + 10 Image vectors
  • Evaluation Metrics (via RAGAS):

    • Context Recall
    • Faithfulness
    • Factual Correctness
  • Bootstrapped Testing: 50 questions × 10 iterations for robustness.


📈 Key Results

  • Adding images improved LLM response quality for general questions.
  • Specific questions benefited more from highly scoped contexts rather than extra images.
  • Too many images could decrease recall slightly, suggesting a need for careful balancing.
  • Zero-shot models performed reasonably well on generic ML questions.

💡 Innovations and Future Work

Smart Agentic RAG (Future Direction):

  • Dynamically analyze query specificity.
  • Adjust number of text and image vectors retrieved based on need.
  • Integrate knowledge graphs to cluster concepts and further improve retrieval relevance.

Challenges Ahead:

  • Evaluation remains computationally expensive.
  • Smart context pruning needs more exploration.

🙏 Acknowledgements

We are grateful to our mentors and sponsors:

  • Dr. Brian Wright (University of Virginia)
  • Lucas McCabe (LMI Inc.)
  • Dr. Brant Horio (LMI Inc.)
  • Ali Rivera (University of Virginia)

Special thanks to the UVA School of Data Science and LMI Inc. for supporting this research!


🌐 Live Visualization

➡️ Interactive Embedding Visualization of Vector DB Retrieval

About

Using Educational Data to Explore Multimodal (Audio, Video, & Text) LLM Retrieval Techniques

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors