19 lines (16 loc) · 6.85 KB

LinkReview

Here we collect all the works that may be useful for writing our paper
We divide these works by topic in order to structure them

Note

This review table will be updated, so it is not a final version.

Topic	Title	Year	Authors	Paper	Code	Summary
Muon	Old Optimizer, New Norm:An Anthology	2024	Jeremy Bernstein	Paper	-	The paper shows that several popular optimizers can be seen as basic steepest-descent methods under different norms when EMA or accumulation is not used. For example, Adam becomes sign descent, Shampoo becomes spectral or SVD-based descent, and Prodigy becomes sign descent with an adaptive step-size rule. The main point is that the choice of norm or geometry has a big impact on how the optimizer behaves.
Muon/	Muon: An optimizer for hidden layers in neural networks	2024	Jordan Keller	Blog	-	In that section, the author explains that Muon is closely related to Shampoo. If you take away Shampoo’s preconditioner accumulation, its update works like an orthogonalized gradient. Muon tries to achieve a similar result, but it does so more efficiently by orthogonalizing the momentum update using a fast Newton-Schulz procedure instead of costly matrix decompositions.
Muon	A Note on the Convergence of Muon	2025	Jiaxiang Li	Paper	-	This paper studies how the Muon optimizer, which was designed for LLM pretraining, converges. It connects Muon to a specific steepest-descent method, where the update direction is found by minimizing a quadratic approximation with a spectral-norm constraint. The paper presents convergence results for two versions of Muon and briefly explains what these results mean for when and how Muon can be used in practice.
Muon	Deriving Muon	2025	Jeremy Bernstein	Blog	-	The main idea behind Muon is simple: for dense linear layers, update the weights in a way that keeps the layer’s outputs from changing too much, instead of just following the raw gradient. This approach finds an update direction by removing the scale from the gradient, making it orthogonal. The post also shows how to calculate this direction efficiently with a fast Newton–Schulz iteration, which avoids the need for a costly decomposition.
Muon	Muon is Scalable for LLM Training	2025	Jingyuan Liu	Paper	-	This paper looks at how to scale the Muon optimizer, which is based on matrix orthogonalization, for large language models. The authors find that two key changes help Muon work well at scale without much tuning: adding weight decay and adjusting the update scale for each parameter to keep updates stable across different matrix shapes. With these updates, Muon can be used easily in large-scale training. Experiments show that Muon is about twice as computationally efficient as AdamW for compute-optimal training. The authors also present Moonlight, a 3B/16B-parameter MoE model trained with Muon on 5.7 trillion tokens, and say it improves the performance-versus-FLOPs Pareto frontier. They have released a distributed, memory-efficient version of Muon and checkpoints to support further research.
Orthogonalization	An Iterative Algorithm for Computing the Best Estimate of an Orthogonal Matrix	1971	Åke Björck	Paper	-	In his classic paper (Björck, 1971), the author introduces an iterative method for finding the closest orthogonal matrix to a given, usually non-orthogonal, matrix. This method aims for the best orthogonal approximation using the least-squares or Frobenius norm. The algorithm gradually adjusts a nearly orthogonal matrix until it becomes exactly orthogonal, focusing on the orthogonal factor in the matrix’s polar decomposition. The paper also examines when the iteration converges and how quickly it does so.
Orthogonalization	A Schur-Newton Method for the Matrix p’th Root and its Inverse	2006	Guo, Chun-Hua and Higham, Nicholas J.	Paper	-	This paper (Guo & Higham, 2006) introduces practical algorithms that reliably compute the principal matrix pth root and its inverse. It looks at a Newton iteration that uses only multiplication to find the inverse pth root, describes the region where this method converges quickly, and explains how to choose the best scaling, especially when eigenvalues are real and positive. The authors point out that the basic inverse Newton method can be unstable, so they suggest a more stable coupled iteration, which also leads to a new method for the pth root. For general matrices, they present a hybrid Schur–Newton algorithm: first, perform a Schur decomposition, then use repeated square roots to reach a fast-converging stage, and finally apply the coupled Newton iteration. This approach often works better than using only Schur-based methods, especially when p is large and not highly composite.
Orthogonalization	Some Iterative Methods for Improving Orthonormality	1970	Zdislav Kovarik	Paper	-	The paper (Kovářík, 1970) introduces straightforward iterative methods to improve orthonormality. Starting with linearly independent vectors or a matrix with nearly orthonormal columns, the method repeatedly applies a matrix correction to make the columns exactly orthonormal. One important result is an iteration that converges quadratically if you start close enough, but it requires solving or inverting a symmetric positive definite matrix at each step. Later research explores polynomial and inversion-free versions of this approach.
Muon	Stochastic Spectral Descent for Discrete Graphical Models	2016	David Carlson	Paper	-	This paper introduces Stochastic Spectral Descent (SSD), an optimizer that needs minimal tuning to train discrete probabilistic graphical models. SSD works with both undirected models like RBMs and MRFs, as well as directed models. Unlike standard SGD, which uses Euclidean geometry, SSD relies on a non-Euclidean geometry based on the spectral (Schatten-∞) norm, inspired by majorization-minimization bounds. This method adjusts update directions to better match the matrix structure of model parameters, while keeping extra computation low compared to the cost of estimating gradients with methods like MCMC or contrastive divergence. The authors describe convergence conditions and present experiments showing that SSD trains models much faster, often requiring up to ten times fewer iterations, and achieves better predictive performance than SGD-based methods.