Skip to content

Siddhant-K-code/s3-vectors-benchmark

Repository files navigation

S3 Vectors Benchmark

A comprehensive benchmark suite for comparing Amazon S3 Vectors against FAISS, NMSLib, and brute-force search methods at scale (10K to 10M vectors).

Overview

This project provides a complete framework for benchmarking vector similarity search performance across different methods:

  • Amazon S3 Vectors - AWS managed vector database
  • FAISS - Facebook AI Similarity Search (HNSW index)
  • NMSLib - Non-Metric Space Library (HNSW index)
  • Brute-force - Baseline cosine similarity search

The benchmark evaluates:

  • Query latency across different vector counts
  • Search accuracy (Recall@K) using UKBench dataset
  • Scalability from 10K to 10M vectors
  • Memory efficiency and resource usage

Features

  • 🚀 Multiple Vector Databases: Support for S3 Vectors, FAISS, NMSLib, and brute-force
  • 📊 Comprehensive Metrics: Query latency, recall, precision, and more
  • 📈 Visualization: Automatic chart generation for results analysis
  • 🔄 Resume Capability: Checkpoint support for long-running benchmarks
  • 💾 Embedding Caching: Efficient storage and retrieval of embeddings
  • 🎯 UKBench Dataset: Standard evaluation dataset with ground truth
  • ⚙️ Configurable: YAML-based configuration for all parameters

Prerequisites

  • Python: 3.9 or higher
  • uv: Fast Python package installer (recommended) or pip as fallback
  • AWS Account: For S3 Vectors testing
  • Storage: Sufficient disk space for datasets (~10-50 GB)
  • Memory: 8 GB+ RAM recommended
  • GPU (optional): For faster embedding generation

Installation

1. Clone the repository

git clone https://github.com/Siddhant-K-code/s3-vectors-benchmark.git
cd s3-vectors-benchmark

2. Install uv (if not already installed)

# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or with pip
pip install uv

3. Install dependencies with uv

Option A: Using uv sync (recommended)

# Install project and dependencies (creates .venv automatically)
uv sync

# Or with dev dependencies for testing
uv sync --dev

Option B: Manual virtual environment

# Create virtual environment
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install project in editable mode with dependencies
uv pip install -e .

# Or install with dev dependencies
uv pip install -e ".[dev]"

4. Configure AWS credentials

aws configure

Or set environment variables:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1

5. Create configuration file

cp config.yaml.example config.yaml
# Edit config.yaml with your settings

Required settings in config.yaml:

  • AWS region and S3 bucket name
  • Dataset cache directories
  • Benchmark parameters

Quick Start

1. Download and prepare datasets

uv run python src/main.py prepare-data --dataset ukbench --download
uv run python src/main.py prepare-data --dataset coco --download

2. Generate embeddings

uv run python src/main.py generate-embeddings --model vit-s --dimension 384

This will:

  • Load images from UKBench dataset
  • Generate embeddings using DINOv2-small model
  • Cache embeddings to HDF5 file

3. Run benchmark

uv run python src/main.py benchmark \
  --embeddings data/embeddings/vit-s/embeddings_384d.h5 \
  --vectors 10200 100000 1000000 \
  --methods s3_vectors faiss nmslib \
  --quick

The --quick flag runs a smaller test with fewer queries.

4. Generate visualizations

uv run python src/main.py visualize --latest

This generates three charts:

  • processing_time_ratio.png - Processing time normalized to smallest dataset
  • search_accuracy.png - Recall@K across different vector counts
  • processing_time_ms.png - Query latency in milliseconds (S3 Vectors)

Usage

Data Preparation

Download and prepare datasets:

# UKBench only
uv run python src/main.py prepare-data --dataset ukbench --download

# COCO only
uv run python src/main.py prepare-data --dataset coco --download

# All datasets
uv run python src/main.py prepare-data --dataset all --download

Embedding Generation

Generate embeddings with different models:

# DINOv2-small (384-dim)
uv run python src/main.py generate-embeddings --model vit-s

# DINOv2-base (768-dim)
uv run python src/main.py generate-embeddings --model vit-b

# DINOv2-large (1024-dim)
uv run python src/main.py generate-embeddings --model vit-l

Benchmark Execution

Run full benchmark suite:

uv run python src/main.py benchmark \
  --embeddings data/embeddings/vit-s/embeddings_384d.h5 \
  --vectors 10200 100000 500000 1000000 10000000 \
  --dimensions 384 \
  --methods s3_vectors faiss nmslib bruteforce \
  --output results/full_benchmark.json

Options:

  • --vectors: Vector counts to test (multiple values)
  • --dimensions: Vector dimensions to test
  • --methods: Methods to benchmark (s3_vectors, faiss, nmslib, bruteforce)
  • --quick: Quick test with fewer queries
  • --dry-run: Estimate time/resources without running

Testing S3 Vectors

Test S3 Vectors connection and basic operations:

uv run python src/main.py test-s3 \
  --embeddings data/embeddings/vit-s/embeddings_384d.h5 \
  --vectors 1000000 \
  --dimension 384

Visualization

Generate charts from results:

# Use latest results file
uv run python src/main.py visualize --latest

# Specify results file
uv run python src/main.py visualize \
  --results-file results/benchmark_results_20250102_120000.json \
  --output-dir results/charts

Project Structure

s3-vectors-benchmark/
├── README.md                 # This file
├── pyproject.toml            # Modern Python package configuration (uv)
├── requirements.txt          # Legacy pip requirements (for reference)
├── setup.py                  # Package setup (legacy)
├── config.yaml.example       # Configuration template
├── .gitignore                # Git ignore rules
├── .python-version           # Python version specification
├── src/                      # Source code
│   ├── __init__.py
│   ├── main.py               # CLI entry point
│   ├── config.py              # Configuration management
│   ├── data_loader.py         # Dataset loading
│   ├── embeddings.py          # Embedding generation
│   ├── benchmark.py           # Benchmark orchestration
│   ├── evaluate.py            # Accuracy evaluation
│   ├── visualize.py           # Chart generation
│   ├── utils.py               # Utility functions
│   └── vector_dbs/            # Vector database implementations
│       ├── base.py             # Abstract base class
│       ├── s3_vectors.py       # S3 Vectors implementation
│       ├── faiss_db.py         # FAISS implementation
│       ├── nmslib_db.py        # NMSLib implementation
│       └── bruteforce.py       # Brute-force baseline
├── tests/                     # Test suite
├── notebooks/                 # Jupyter notebooks
├── docs/                      # Documentation
├── data/                       # Datasets (git-ignored)
└── results/                    # Results (git-ignored)

Configuration

Edit config.yaml to customize:

AWS Configuration

aws:
  region: us-east-1
  profile: default
  bucket_name: your-vector-bucket-name

S3 Vectors Configuration

s3_vectors:
  index_name: benchmark-index
  metric_type: cosine  # or euclidean
  batch_size: 500

Benchmark Configuration

benchmark:
  vector_counts: [10200, 100000, 500000, 1000000, 10000000]
  dimensions: [384, 768, 1024]
  topk: 5
  num_queries: 100
  num_repeats: 3

Results Analysis

After running benchmarks, results are saved to JSON files with:

  • Raw measurements: Query latency, result IDs, similarities
  • Evaluation metrics: Recall@K, Precision@K, aggregated statistics
  • Metadata: Configuration, timestamps, vector counts

Reading Results

Results are stored in JSON format:

import json

with open("results/benchmark_results_20250102.json", "r") as f:
    results = json.load(f)

# Access evaluation metrics
evaluation = results["evaluation"]
for config_key, metrics in evaluation.items():
    print(f"{config_key}:")
    print(f"  Recall@K: {metrics['recall_at_k']['mean']:.3f}")
    print(f"  Query time: {metrics['query_time_ms']['mean']:.2f} ms")

Charts

Charts are automatically generated showing:

  1. Processing Time Ratio: Normalized query latency across methods
  2. Search Accuracy: Recall@K across vector counts
  3. Processing Time (ms): S3 Vectors query latency

Troubleshooting

AWS Credentials Not Found

# Verify credentials
aws sts get-caller-identity

# Or set environment variables
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...

S3 Bucket Not Found

Ensure the bucket exists and is accessible:

aws s3 ls s3://your-bucket-name

Out of Memory

For large datasets:

  • Reduce batch_size in embeddings config
  • Process in smaller chunks
  • Use GPU for embedding generation

Dataset Download Fails

Some datasets may require manual download. Check:

  • Network connectivity
  • Sufficient disk space
  • Dataset availability

uv Installation Issues

If uv is not found:

  • Ensure it's in your PATH
  • Restart your terminal after installation
  • Use pip install uv as fallback

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests (run with uv run pytest)
  5. Submit a pull request

License

MIT License - see LICENSE file for details

Citation

If you use this benchmark in your research, please cite:

@software{s3_vectors_benchmark,
  title={S3 Vectors Benchmark},
  author={Siddhant Khare},
  year={2025},
  url={https://github.com/siddhant-k-code/s3-vectors-benchmark}
}

Acknowledgments

Support

For issues and questions:

  • Open an issue on GitHub
  • Check documentation in docs/ directory
  • Review example notebooks in notebooks/

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors