This repository contains code and resources for our paper "Self-Supervised Representation Learning as Mutual Information Maximization" available at this link.
Our work unifies SSRL methods under two distinct optimization paradigms:
Figure 1: Canonical forms of our proposed paradigms: (a) SDMI alternates updates between two encoders using stop-gradients, while (b) JMI jointly updates both views with shared gradients.
-
A Unified MI Maximization View: We formulate a general MI maximization perspective under the DV bound, showing that existing SSRL methods implicitly follow one of two optimization paradigms, namely Self-Distillation MI (SDMI) or Joint MI (JMI).
-
Explaining Architectural Components: We show that design elements like stop-gradients, exponential moving average targets, predictor networks, and statistical regularizers are not heuristics, but theoretically necessary under MI-based objectives, providing a formal explanation for common design choices.
-
Unifying Existing SSRL Methods: We show that many well-known SSRL approaches (e.g., SimCLR, BYOL, SimSiam) can be mapped directly to our two paradigms. This helps unify the field under a shared theoretical lens and offers guidance for future method design.
All dependencies are listed in requirements.txt
git clone https://github.com/layer6ai-labs/ssrl-mi-maximization.git
cd ssrl-mi-maximization
pip install -r requirements.txtAll experimental scripts are provided in the scripts/ directory. The codebase supports training on CIFAR-10/100, TinyImageNet, and ImageNet100 datasets with various SSRL methods.
Each script is self-contained and includes all necessary hyperparameters. Run from the project root directory:
# Run specific experiments from project root
bash scripts/CIFAR10/SDMI.sh
bash scripts/CIFAR10/JMI.sh
bash scripts/CIFAR100/BarlowTwins.sh
bash scripts/ImageNet100/SimCLR.shpython3 main.py \
--model_name SDMI \
--dataset ImageNet100 \
--architecture ResNet50 \
--num_classes 100 \
--epochs 800 \
--warmup_epochs 10 \
--batch_size 64 \
--initial_lr 0.05 \
--weight_decay 0.0001 \
--temperature 0.1 \
--num_runs 3 \
--augmentation \
--feature_dim 512 \
--projection_dim 256 \
--projection_layer 3 \
--model_save_interval 50 \
--model_evaluation_interval 1000 \
--num_workers 16- SDMI Prototype: Our canonical Self-Distillation MI implementation
- JMI Prototype: Our canonical Joint MI implementation
- Baseline Methods: SimCLR, BYOL, SimSiam, MoCo-v3, Barlow Twins, VICReg
- CIFAR-10/100
- TinyImageNet
- ImageNet100
- Synthetic Gaussian Mixture (for controlled experiments)
Figure 2: Estimated MI over CIFAR10 training for SDMI-based (top row) and JMI-based (bottom row) methods, using three estimators (cos–DV, InfoNCE and JSD; left to right). Both paradigms exhibit consistent MI growth: SDMI curves feature early fluctuations before trending upward, while JMI estimates rise more uniformly, and to much higher levels.
Figure 3: Embedding trajectories of the five Gaussian cluster centers. Opacity increases over training. SDMI separates centers more distinctly than analogous methods.
| Model | CIFAR10 | CIFAR100 | TinyImageNet | ImageNet100 |
|---|---|---|---|---|
| SDMI prototype (fθ) | 88.61 ± 0.13 | 57.37 ± 0.38 | 33.30 ± 0.58 | 70.73 ± 0.57 |
| SDMI prototype (gξ) | 88.59 ± 0.35 | 57.85 ± 0.32 | 32.94 ± 0.71 | 70.83 ± 0.16 |
| SimSiam | 89.72 ± 0.18 | 60.45 ± 0.60 | 19.19 ± 0.69 | 78.23 ± 0.58 |
| BYOL | 91.28 ± 0.16 | 63.11 ± 0.21 | 32.77 ± 0.10 | 81.09 ± 0.61 |
| MoCo-v3 | 91.10 ± 0.16 | 58.90 ± 0.32 | 32.18 ± 0.55 | 76.86 ± 0.74 |
| JMI prototype | 88.01 ± 0.48 | 57.22 ± 0.56 | 32.23 ± 0.52 | 73.41 ± 0.36 |
| SimCLR | 87.24 ± 0.37 | 55.32 ± 0.46 | 33.79 ± 0.31 | 75.31 ± 0.76 |
| Barlow Twins | 85.56 ± 0.71 | 51.91 ± 0.49 | 30.26 ± 0.12 | 78.96 ± 0.30 |
| VICReg | 85.49 ± 1.03 | 54.00 ± 0.34 | 32.03 ± 0.32 | 78.86 ± 0.23 |
Table 1: Linear probing accuracy (%) on four datasets. Mean ± std over 3 runs.
- Monotonic MI Growth: Both paradigms demonstrate consistent mutual information increase during training
- Competitive Performance: Canonical forms achieve performance comparable to established methods
- Theoretical Alignment: Empirical behavior matches theoretical predictions
To reproduce the main results from our paper, run each of the scripts
├── main.py # Entry point for training
├── README.md
├── requirements.txt
├── assets/ # Figures, logos, and result plots for README/paper
├── controlled_experiments/ # Experiments using smaller networks for the Gaussian dataset
├── scripts/ # Training scripts for all datasets and methods
│ ├── CIFAR10/
│ ├── CIFAR100/
│ ├── TinyImageNet/
│ └── ImageNet100/
├── src/
│ ├── trainer.py # Core training loop
│ ├── evaluation.py # Linear probing and metrics
│ ├── models/ # SDMI, JMI, and baseline implementations
│ └── utils/ # Losses, data loaders, checkpointing, logging
If you use any part of this repository in your research, please cite the associated paper with the following bibtex entry:
@article{sabby2025ssl,
title={Self-Supervised Representation Learning as Mutual Information Maximization},
author={Sabby, Akhlaqur Rahman and Sui, Yi and Wu, Tongzi and Cresswell, Jesse C and Wu, Ga},
journal={arXiv:2510.01345},
year={2025}
}






