Self-Supervised Representation Learning as Mutual Information Maximization

This repository contains code and resources for our paper "Self-Supervised Representation Learning as Mutual Information Maximization" available at this link.

Conceptual Overview

Our work unifies SSRL methods under two distinct optimization paradigms:

Figure 1: Canonical forms of our proposed paradigms: (a) SDMI alternates updates between two encoders using stop-gradients, while (b) JMI jointly updates both views with shared gradients.

Key Contributions

A Unified MI Maximization View: We formulate a general MI maximization perspective under the DV bound, showing that existing SSRL methods implicitly follow one of two optimization paradigms, namely Self-Distillation MI (SDMI) or Joint MI (JMI).
Explaining Architectural Components: We show that design elements like stop-gradients, exponential moving average targets, predictor networks, and statistical regularizers are not heuristics, but theoretically necessary under MI-based objectives, providing a formal explanation for common design choices.
Unifying Existing SSRL Methods: We show that many well-known SSRL approaches (e.g., SimCLR, BYOL, SimSiam) can be mapped directly to our two paradigms. This helps unify the field under a shared theoretical lens and offers guidance for future method design.

Quick Start Guide

Prerequisites

All dependencies are listed in requirements.txt

Installation

git clone https://github.com/layer6ai-labs/ssrl-mi-maximization.git
cd ssrl-mi-maximization
pip install -r requirements.txt

Running Experiments

All experimental scripts are provided in the scripts/ directory. The codebase supports training on CIFAR-10/100, TinyImageNet, and ImageNet100 datasets with various SSRL methods.

Training Scripts

Each script is self-contained and includes all necessary hyperparameters. Run from the project root directory:

# Run specific experiments from project root
bash scripts/CIFAR10/SDMI.sh
bash scripts/CIFAR10/JMI.sh
bash scripts/CIFAR100/BarlowTwins.sh
bash scripts/ImageNet100/SimCLR.sh

Example Command Structure

python3 main.py \
    --model_name SDMI \
    --dataset ImageNet100 \
    --architecture ResNet50 \
    --num_classes 100 \
    --epochs 800 \
    --warmup_epochs 10 \
    --batch_size 64 \
    --initial_lr 0.05 \
    --weight_decay 0.0001 \
    --temperature 0.1 \
    --num_runs 3 \
    --augmentation \
    --feature_dim 512 \
    --projection_dim 256 \
    --projection_layer 3 \
    --model_save_interval 50 \
    --model_evaluation_interval 1000 \
    --num_workers 16

Available Models

SDMI Prototype: Our canonical Self-Distillation MI implementation
JMI Prototype: Our canonical Joint MI implementation
Baseline Methods: SimCLR, BYOL, SimSiam, MoCo-v3, Barlow Twins, VICReg

Supported Datasets

CIFAR-10/100
TinyImageNet
ImageNet100
Synthetic Gaussian Mixture (for controlled experiments)

Experimental Results

Mutual Information Dynamics

Figure 2: Estimated MI over CIFAR10 training for SDMI-based (top row) and JMI-based (bottom row) methods, using three estimators (cos–DV, InfoNCE and JSD; left to right). Both paradigms exhibit consistent MI growth: SDMI curves feature early fluctuations before trending upward, while JMI estimates rise more uniformly, and to much higher levels.

Representation Space Quality

Figure 3: Embedding trajectories of the five Gaussian cluster centers. Opacity increases over training. SDMI separates centers more distinctly than analogous methods.

Linear Probe Accuracy

Model	CIFAR10	CIFAR100	TinyImageNet	ImageNet100
SDMI prototype (fθ)	88.61 ± 0.13	57.37 ± 0.38	33.30 ± 0.58	70.73 ± 0.57
SDMI prototype (gξ)	88.59 ± 0.35	57.85 ± 0.32	32.94 ± 0.71	70.83 ± 0.16
SimSiam	89.72 ± 0.18	60.45 ± 0.60	19.19 ± 0.69	78.23 ± 0.58
BYOL	91.28 ± 0.16	63.11 ± 0.21	32.77 ± 0.10	81.09 ± 0.61
MoCo-v3	91.10 ± 0.16	58.90 ± 0.32	32.18 ± 0.55	76.86 ± 0.74
JMI prototype	88.01 ± 0.48	57.22 ± 0.56	32.23 ± 0.52	73.41 ± 0.36
SimCLR	87.24 ± 0.37	55.32 ± 0.46	33.79 ± 0.31	75.31 ± 0.76
Barlow Twins	85.56 ± 0.71	51.91 ± 0.49	30.26 ± 0.12	78.96 ± 0.30
VICReg	85.49 ± 1.03	54.00 ± 0.34	32.03 ± 0.32	78.86 ± 0.23

Table 1: Linear probing accuracy (%) on four datasets. Mean ± std over 3 runs.

Summary of Results

Monotonic MI Growth: Both paradigms demonstrate consistent mutual information increase during training
Competitive Performance: Canonical forms achieve performance comparable to established methods
Theoretical Alignment: Empirical behavior matches theoretical predictions

Reproducing Paper Results

To reproduce the main results from our paper, run each of the scripts

Project Structure

├── main.py                  # Entry point for training
├── README.md
├── requirements.txt
├── assets/                  # Figures, logos, and result plots for README/paper
├── controlled_experiments/  # Experiments using smaller networks for the Gaussian dataset
├── scripts/                 # Training scripts for all datasets and methods
│   ├── CIFAR10/
│   ├── CIFAR100/
│   ├── TinyImageNet/
│   └── ImageNet100/
├── src/
│   ├── trainer.py           # Core training loop
│   ├── evaluation.py        # Linear probing and metrics
│   ├── models/              # SDMI, JMI, and baseline implementations
│   └── utils/               # Losses, data loaders, checkpointing, logging

Citing

If you use any part of this repository in your research, please cite the associated paper with the following bibtex entry:

@article{sabby2025ssl,
  title={Self-Supervised Representation Learning as Mutual Information Maximization},
  author={Sabby, Akhlaqur Rahman and Sui, Yi and Wu, Tongzi and Cresswell, Jesse C and Wu, Ga},
  journal={arXiv:2510.01345},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Supervised Representation Learning as Mutual Information Maximization

Conceptual Overview

Key Contributions

Quick Start Guide

Prerequisites

Installation

Running Experiments

Training Scripts

Example Command Structure

Available Models

Supported Datasets

Experimental Results

Mutual Information Dynamics

Representation Space Quality

Linear Probe Accuracy

Summary of Results

Reproducing Paper Results

Project Structure

Citing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
controlled_experiments		controlled_experiments
scripts		scripts
src		src
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Self-Supervised Representation Learning as Mutual Information Maximization

Conceptual Overview

Key Contributions

Quick Start Guide

Prerequisites

Installation

Running Experiments

Training Scripts

Example Command Structure

Available Models

Supported Datasets

Experimental Results

Mutual Information Dynamics

Representation Space Quality

Linear Probe Accuracy

Summary of Results

Reproducing Paper Results

Project Structure

Citing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages