Skip to content

codewithfourtix/semantic-search-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChromaDB Benchmark Logo

🔍 Semantic Search Benchmark

Python 3.8+ ChromaDB Sentence Transformers MIT License

⚡ High-Performance Semantic Search Comparison Framework

Compare different embedding models side-by-side with real benchmarks and metrics

Quick StartResultsModelsCustomization


📊 What This Project Does

This is a complete benchmarking framework for comparing different semantic search embedding models. Instead of guessing which model works best for semantic search, get real data:

  • Speed Metrics: Document indexing time and query latency
  • 🎯 Relevance Scores: Semantic distance measurements
  • 📈 Comparative Analysis: Side-by-side performance tables
  • 🧪 Real-World Testing: 20 diverse documents across 4 categories
  • 🔄 Reproducible Results: Same test suite every run

🎯 Models Compared

Model Size Speed Accuracy Best For
🔵 Default Lightweight ⚡⚡⚡ ⭐⭐ Quick prototypes
🟢 MiniLM-L6-v2 38M ⚡⚡⚡⚡ ⭐⭐⭐⭐ Production (Balanced)
🟣 MPNet-base-v2 109M ⚡⚡ ⭐⭐⭐⭐⭐ High-accuracy search

📈 Benchmark Results

Performance Comparison

Testing: default
Documents added in: 3.573s
Avg query time: 0.4861s
Avg relevance distance: 0.8008 (lower is better)

Query                          Time (ms)    Distance
Sport-related query            640.50       0.9197
Tech-related query             455.72       0.8826
Finance-related query          434.48       0.6427
Health-related query           478.39       0.6865
Programming query              421.42       0.8725

─────────────────────────────────────────────────────────
Testing: all-MiniLM-L6-v2
Documents added in: 0.649s
Avg query time: 0.0313s
Avg relevance distance: 0.4004 (lower is better)

Query                          Time (ms)    Distance
Sport-related query            61.40        0.4599
Tech-related query             23.72        0.4413
Finance-related query          26.98        0.3214
Health-related query           26.36        0.3433
Programming query              18.19        0.4363

─────────────────────────────────────────────────────────
Testing: all-mpnet-base-v2
Documents added in: 1.396s
Avg query time: 0.1359s
Avg relevance distance: 0.4164 (lower is better)

Query                          Time (ms)    Distance
Sport-related query            156.71       0.5530
Tech-related query             140.52       0.4529
Finance-related query          160.53       0.3032
Health-related query           104.48       0.3088
Programming query              117.06       0.4640

🏆 Summary Comparison

Model Add Time Query Time Relevance Speed Rank Accuracy Rank
Default 3.57s 486ms 0.8008 🥉 🥉
MiniLM-L6-v2 0.65s 31ms 0.4004 🥇 🥇
MPNet-base-v2 1.40s 136ms 0.4164 🥈 🥈

🎖️ Winners

Category Winner Metric
⚡ Fastest Document Indexing all-MiniLM-L6-v2 0.649s
🚀 Fastest Query Time all-MiniLM-L6-v2 31ms avg
🎯 Best Relevance Score all-MiniLM-L6-v2 0.4004 distance

🚀 Quick Start

📦 Installation

pip install -r requirements.txt

▶️ Run Benchmark

python benchmark.py

Expected output: ~2-3 minutes (first run downloads embedding models)

📂 Project Structure

semantic-search-benchmark/
├── benchmark.py              # Main benchmark runner
├── benchmark_data.py         # Dataset & test queries
├── test.py                   # Basic in-memory example
├── test_v2.py               # Persistent storage example
├── test_v3.py               # Semantic search demo
├── requirements.txt         # Dependencies
├── .gitignore              # Git ignore rules
└── README.md               # This file

🔍 Test Dataset

📚 20 Documents Across 4 Categories

⚽ Sports (5 docs)

  • Ronaldo scored an incredible goal last night
  • Messi won the World Cup with Argentina
  • Real Madrid won the Champions League final
  • Liverpool defeated Manchester City
  • Nadal won his 14th tennis grand slam

💻 Technology (5 docs)

  • Python is a great programming language for beginners
  • Machine learning is used in stock price prediction
  • TensorFlow and PyTorch are popular deep learning frameworks
  • Artificial intelligence is revolutionizing software development
  • Cloud computing provides scalable infrastructure

💰 Finance (5 docs)

  • The stock market crashed badly this week
  • Interest rates are rising due to inflation
  • Bitcoin reached a new all-time high
  • The Federal Reserve raised interest rates
  • Real estate prices continue to rise in major cities

🏥 Health (5 docs)

  • Regular exercise improves cardiovascular health
  • COVID-19 vaccines have saved millions of lives
  • Mental health awareness is becoming increasingly important
  • Healthy diet and sleep patterns prevent chronic diseases
  • Meditation reduces stress and anxiety

🎯 Test Queries (5 Total)

  1. ⚽ "football player scored a goal"
  2. 🧠 "machine learning and artificial intelligence"
  3. 📊 "money and market crash"
  4. 💪 "exercise and health"
  5. 🖥️ "programming languages and software"

📊 Key Insights

💡 Learning Points

Model Speed Accuracy Use Case
Default Very Fast Low Quick tests, low-stakes
MiniLM ⚡ Fastest ⭐⭐⭐⭐ High RECOMMENDED - Most scenarios
MPNet Slower 🎯 Highest Accuracy critical tasks

🎓 Recommendations

  • 🏃 Need speed? → Use MiniLM (balanced winner)
  • 🎯 Need accuracy? → Use MPNet (but slower)
  • 🚀 Production ready? → Use MiniLM (best all-around)
  • 🧪 Just testing? → Use Default (fastest to setup)

🔧 Customization

➕ Add More Embedding Models

Edit benchmark.py:

models_to_test = [
    ("default", "./benchmark_db_default"),
    ("all-MiniLM-L6-v2", "./benchmark_db_minilm"),
    ("all-mpnet-base-v2", "./benchmark_db_mpnet"),
    ("your-model-name", "./benchmark_db_custom"),  # Add here
]

➕ Add More Test Queries

Edit benchmark_data.py:

QUERY_TESTS = [
    {
        "query": "your query here",
        "expected_category": "category",
        "description": "Your description"
    },
]

➕ Modify Test Documents

Edit benchmark_data.py DOCUMENTS list to test different domains.


📚 Resources & References

Resource Link Description
📖 ChromaDB Docs trychroma.com Official documentation
🤗 Sentence Transformers sbert.net Embedding models & guide
🏆 Model Leaderboard huggingface.co/spaces/mteb Compare all models
🔬 MTEB Benchmark github.com/embeddings-benchmark Industry benchmarks

🛠️ Requirements

  • Python 3.8+
  • chromadb >= 0.4.0
  • sentence-transformers >= 2.2.0
  • numpy >= 1.21.0

📝 Files Overview

File Purpose
benchmark.py 🎯 Core benchmarking logic & runner
benchmark_data.py 📚 Test dataset & queries
test.py 📖 Basic embedding example
test_v2.py 💾 Persistent storage demonstration
test_v3.py 🔍 Semantic search demo
requirements.txt 📦 Python dependencies
.gitignore 🚫 Git exclusion rules

🚦 Getting Started

Step 1: Install Dependencies

pip install -r requirements.txt

Step 2: Run Benchmark

python benchmark.py

Step 3: Review Results

  • Check terminal output for detailed metrics
  • Compare models in the summary table
  • Identify winner for your use case

Step 4: Customize (Optional)

  • Add your own documents to benchmark_data.py
  • Test additional embedding models
  • Modify query tests for your domain

📊 Performance Characteristics

⚡ Speed Profile

  • Default: Instant setup, moderate query speed
  • MiniLM: Best performance (fastest overall)
  • MPNet: Higher latency, best accuracy

🎯 Accuracy Profile

  • Default: Basic semantic understanding
  • MiniLM: Good understanding of document relationships
  • MPNet: Excellent semantic comprehension

💾 Memory Usage

  • Default: Minimal footprint
  • MiniLM: ~384MB loaded (38M params)
  • MPNet: ~1.2GB loaded (109M params)

🤝 Contributing

Have improvements? Found a bug? Want to add:

  • More embedding models for comparison?
  • Different benchmark datasets?
  • Additional metrics?

Feel free to fork and submit pull requests!


📄 License

MIT License - Feel free to use this in your projects!


⭐ If This Helped You, Consider Giving It a Star! ⭐

Built with ❤️ for semantic search enthusiasts

Made to compare, benchmark, and optimize ChromaDB embedding models

Questions? Open an issue or check the docs 📚

About

Benchmark semantic search embedding models. Real performance metrics: indexing speed, query latency & relevance scores.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages