- Here we collect all the works that may be useful for writing our paper
- We divide these works by topic in order to structure them
Note
This review table will be updated, so it is not a final version.
| Topic | Title | Year | Authors | Paper | Code | Summary |
|---|---|---|---|---|---|---|
| Muon | Old Optimizer, New Norm:An Anthology | 2024 | Jeremy Bernstein | Paper | - | The paper shows that several popular optimizers can be seen as basic steepest-descent methods under different norms when EMA or accumulation is not used. For example, Adam becomes sign descent, Shampoo becomes spectral or SVD-based descent, and Prodigy becomes sign descent with an adaptive step-size rule. The main point is that the choice of norm or geometry has a big impact on how the optimizer behaves. |
| Muon/ | Muon: An optimizer for hidden layers in neural networks | 2024 | Jordan Keller | Blog | - | In that section, the author explains that Muon is closely related to Shampoo. If you take away Shampoo’s preconditioner accumulation, its update works like an orthogonalized gradient. Muon tries to achieve a similar result, but it does so more efficiently by orthogonalizing the momentum update using a fast Newton-Schulz procedure instead of costly matrix decompositions. |
| Muon | A Note on the Convergence of Muon | 2025 | Jiaxiang Li | Paper | - | This paper studies how the Muon optimizer, which was designed for LLM pretraining, converges. It connects Muon to a specific steepest-descent method, where the update direction is found by minimizing a quadratic approximation with a spectral-norm constraint. The paper presents convergence results for two versions of Muon and briefly explains what these results mean for when and how Muon can be used in practice. |
| Muon | Deriving Muon | 2025 | Jeremy Bernstein | Blog | - | The main idea behind Muon is simple: for dense linear layers, update the weights in a way that keeps the layer’s outputs from changing too much, instead of just following the raw gradient. This approach finds an update direction by removing the scale from the gradient, making it orthogonal. The post also shows how to calculate this direction efficiently with a fast Newton–Schulz iteration, which avoids the need for a costly decomposition. |
| Muon | Muon is Scalable for LLM Training | 2025 | Jingyuan Liu | Paper | - | This paper looks at how to scale the Muon optimizer, which is based on matrix orthogonalization, for large language models. The authors find that two key changes help Muon work well at scale without much tuning: adding weight decay and adjusting the update scale for each parameter to keep updates stable across different matrix shapes. With these updates, Muon can be used easily in large-scale training. Experiments show that Muon is about twice as computationally efficient as AdamW for compute-optimal training. The authors also present Moonlight, a 3B/16B-parameter MoE model trained with Muon on 5.7 trillion tokens, and say it improves the performance-versus-FLOPs Pareto frontier. They have released a distributed, memory-efficient version of Muon and checkpoints to support further research. |
| Orthogonalization | An Iterative Algorithm for Computing the Best Estimate of an Orthogonal Matrix | 1971 | Åke Björck | Paper | - | In his classic paper (Björck, 1971), the author introduces an iterative method for finding the closest orthogonal matrix to a given, usually non-orthogonal, matrix. This method aims for the best orthogonal approximation using the least-squares or Frobenius norm. The algorithm gradually adjusts a nearly orthogonal matrix until it becomes exactly orthogonal, focusing on the orthogonal factor in the matrix’s polar decomposition. The paper also examines when the iteration converges and how quickly it does so. |
| Orthogonalization | A Schur-Newton Method for the Matrix p’th Root and its Inverse | 2006 | Guo, Chun-Hua and Higham, Nicholas J. | Paper | - | This paper (Guo & Higham, 2006) introduces practical algorithms that reliably compute the principal matrix pth root and its inverse. It looks at a Newton iteration that uses only multiplication to find the inverse pth root, describes the region where this method converges quickly, and explains how to choose the best scaling, especially when eigenvalues are real and positive. The authors point out that the basic inverse Newton method can be unstable, so they suggest a more stable coupled iteration, which also leads to a new method for the pth root. For general matrices, they present a hybrid Schur–Newton algorithm: first, perform a Schur decomposition, then use repeated square roots to reach a fast-converging stage, and finally apply the coupled Newton iteration. This approach often works better than using only Schur-based methods, especially when p is large and not highly composite. |
| Orthogonalization | Some Iterative Methods for Improving Orthonormality | 1970 | Zdislav Kovarik | Paper | - | The paper (Kovářík, 1970) introduces straightforward iterative methods to improve orthonormality. Starting with linearly independent vectors or a matrix with nearly orthonormal columns, the method repeatedly applies a matrix correction to make the columns exactly orthonormal. One important result is an iteration that converges quadratically if you start close enough, but it requires solving or inverting a symmetric positive definite matrix at each step. Later research explores polynomial and inversion-free versions of this approach. |
| Muon | Stochastic Spectral Descent for Discrete Graphical Models | 2016 | David Carlson | Paper | - | This paper introduces Stochastic Spectral Descent (SSD), an optimizer that needs minimal tuning to train discrete probabilistic graphical models. SSD works with both undirected models like RBMs and MRFs, as well as directed models. Unlike standard SGD, which uses Euclidean geometry, SSD relies on a non-Euclidean geometry based on the spectral (Schatten-∞) norm, inspired by majorization-minimization bounds. This method adjusts update directions to better match the matrix structure of model parameters, while keeping extra computation low compared to the cost of estimating gradients with methods like MCMC or contrastive divergence. The authors describe convergence conditions and present experiments showing that SSD trains models much faster, often requiring up to ten times fewer iterations, and achieves better predictive performance than SGD-based methods. |