Optimizer taxonomy + benchmark

Scaling for Scaling

Taxonomy, Geometry, and Benchmarking of Modern Optimizers

A unified meta-pipeline, LMO-grounded geometry, and cross-domain benchmark for choosing modern optimizers under compute, memory, stability, robustness, and generalization constraints.

Siyuan Li1,3,†, Jiabao Pan1,2,†, Yumou Liu1,4,†, Zhuoli Ouyang7,†, Xin Jin3, Xinglong Xu5, Jingxuan Wei5, Shengye Pang2,*, Jintao Chen6, Xuanhe Zhou4, Conghui He1, Cheng Tan1,*

1Shanghai Artificial Intelligence Laboratory · 2Shanghai University · 3Westlake University · 4Shanghai Jiao Tong University · 5UCAS · 6Zhejiang University · 7Southern University of Science and Technology

Equal contribution. *Corresponding authors.

Overview teaser showing the Scaling for Scaling taxonomy and benchmark workflow
Optimizer design has moved from a single AdamW-style default into a diversified mechanism space. This overview is not a leaderboard; it frames why the paper needs a shared taxonomy, geometry, and benchmark protocol.
100+optimizer methods organized
24benchmarked optimizers
60M-1Blanguage model scales
4 arch.standard and linear-attention models
Framework

One optimizer step as a five-stage transformation.

The paper treats optimizers as structured transformations across parameter routing, gradient transformation, state evolution, update reconstruction, and finalization. Most methods do meaningful work in only one or two stages, making comparison and composition easier.

Universal meta-pipeline for a modern optimizer update
Universal meta-pipeline for one optimizer step. The figure is shown full-width because the stage labels and tensor routes are the central visual evidence in this section.
S0

Signal acquisition

Receives first-order gradients, variance-reduced signals, or curvature-augmented estimates from the training system.

S1

Parameter routing

Partitions tensors by shape and module type so matrices, vectors, heads, or layers can follow different update routes.

S2

Gradient transform

Applies the mechanism that changes direction space: identity maps, sign maps, spectral orthogonalization, Kronecker transforms, or low-rank projection.

S3

State evolution

Maintains moment, curvature, factorized, quantized, or variance-reduced states before a direction is formed.

S4

Reconstruction

Returns transformed or compressed directions to the full parameter space through inverse rotations, projections, or approximations.

S5

Finalization

Writes the update with learning rate, weight decay, clipping, trust ratios, masks, or sharpness-aware corrections.

Bridge to the LMO / four-axis geometry

The meta-pipeline locates where an optimizer intervenes. The geometric view explains what direction that intervention creates: state estimation happens before geometry, and the LMO or preconditioner consumes the estimated state to form the update.

Axis I

Update domain

Where the update lives: full parameter space, matrix space, rotated coordinates, or a low-rank subspace.

Axis II

State estimator

How momentum, second moments, Gram/Hessian proxies, variance reduction, and projection state are produced.

Axis III

Geometry operator

How the state becomes a direction through an LMO constraint set or a Hessian-style preconditioner.

Axis IV

Finalization wrapper

How learning rate, decay, projection-back, routing fallbacks, refresh schedules, and clipping commit the direction.

Representative optimizer families viewed through the universal meta-pipeline. Active stages carry the defining mechanism; remaining stages reduce to identity maps or standard defaults.
Method Active stages Core mechanism Family Pipeline constraint
AdamW S3, S5 First/second-order moment EMAs; decoupled weight decay T1 S2, S4 identity; S1 uniform element-wise
Muon S1, S2 Matrix routing; Newton-Schulz spectral orthogonalization T2.1 S4 trivial (dimension-preserving); S3 standard momentum
GaLore S1-S4 Low-rank projection; subspace Adam state; inverse projection T2.3 Dual S2/S4 via basis P_t; S5 standard
Lion S2, S3 Sign discretization of momentum-interpolated gradient T3 S4 identity; no second-order state; fixed-magnitude update
SAM S0, S5 Perturbation-induced gradient; neighborhood-regularized write-back T5.1 S1-S4 execute element-wise defaults
Dual taxonomy

Method families meet effect objectives.

The page preserves both taxonomy axes: mechanism families T1-T5 and effect objectives O1-O6. This is the map used to interpret benchmark tradeoffs rather than a leaderboard-only view.

T1

Element-wise adaptive moment

Adam-style scalar control and moment estimation.

T2

Matrix-structured methods

Spectral, Kronecker, and subspace update directions.

T3

Discretized directions

Sign-like and quantized update geometry.

T4/T5

Compression and geometry

State reduction, curvature, perturbation, and trust-region controls.

Dimension-B effect objectives and measurement sources.
ID Name Definition Data source Extra cost Typical outputs
O1 Convergence Efficiency Loss reduction or target-quality attainment under fixed step, token, wall-clock, or compute budget Train/validation loss logs None Loss at step T; steps-to-threshold; token efficiency
O2 Step cost Extra optimizer computation, synchronization, or forward/backward cost relative to a baseline Timers, profiler, analytic FLOP model Recorded during training Step time; optimizer FLOPs; extra backward count
O3 Memory Memory from optimizer states, gradient buffers, temporary factors, projection bases, and quantization buffers Memory profiler, analytic byte model Recorded during training Peak memory; persistent state bytes; temporary buffer bytes
O4 Stability Robustness to spikes, divergence, overflow, and gradient fluctuations Loss and gradient-norm time series Offline post-processing Spike frequency; gradient CV; divergence rate; completion rate
O5 Hparam robustness Sensitivity across learning rate, schedule, weight decay, momentum, batch size, and method-specific knobs Multiple training runs Search or transfer experiments Acceptable LR interval; performance variance; tuning burden
O6 Generalization Quality outside the training objective, from validation to downstream, OOD, or scale-transfer evaluation Validation, downstream, OOD, scale-transfer evaluation Low for validation; high for full evaluation Validation loss; train-val gap; downstream score; OOD retention
Compact cross-matrix between method families and effect objectives. ``++'' denotes a primary target, ``+'' a common secondary target, ``0'' protocol-dependent neutrality, and ``-'' a likely cost.
Family O1 O2 O3 O4 O5 O6
T1 Element-wise adaptive moment and scalar control ++ 0 - + + 0/+
T2 Matrix-level structural methods ++ - -/0 + 0/+ 0/+
T3 Discretized directions +/0 + + +/0 0 0
T4 State compression 0/- 0/+ ++ 0/- 0 0/-
T5 Geometry regularization +/0 - 0/- ++ + ++
Taxonomy and four-axis coverage of the 24 benchmarked optimizers. For each optimizer we list its subclass, its four-axis coordinates (Axes I--IV of Section [ref]), and the meta-pipeline stage(s) with a non-identity operation. Axis III is shown through its two equivalent readings, the LMO constraint geometry and the preconditioner, which are two faces of the same direction operator _t. State estimation (Axis II) absorbs momentum, the second-moment / Hessian proxy H_t, variance reduction (FO/VR/STORM), and any state compression, while the update domain (Axis I) records the space in which the update lives. ``full'' denotes the original coordinate space.
Optimizer Sub. Axis I domain Axis II state Axis III LMO Axis III precond. Axis IV finalize Pipeline stages
T1: Element-wise Adaptive Moment Estimation
AdamW T1.1 full m_t,v_t adapt.\ _ box diag(v_t), 12 LR + dec.\ WD S3/S5
RAdam T1.1 full m_t,v_t rect. adapt.\ _ box diag(v_t), 12 rect.\ warmup + WD S3/S5
NAdam T1.1 full Nesterov m_t,v_t adapt.\ _ box diag(v_t), 12 lookahead + WD S3/S5
AdaBelief T1.1 full m_t, belief s_t adapt.\ _ box diag(s_t), 12 LR + WD S3/S5
Adan T1.1 full m_t,v_t grad-diff (VR) adapt.\ _ box diag(v_t), 12 LR + WD S3/S5
MARS-AdamW T1.2 full c_t,m_t,v_t (STORM) adapt.\ _ box diag(v_t), 12 LR + dec.\ WD S0/S3
Prodigy T1.3 full m_t,v_t,d_t adapt.\ _ box diag(v_t), 12 auto-LR + WD S3/S5
T2: Matrix-Structured Methods
Muon T2.1 matrix (U,V) M_t, H_t=M_tM_t^ spectral polar UV^ H_t, 12 LR + matrix routing S1/S2
RMNP T2.1 matrix (row-wise) M_t, H_t=diag(M_tM_t^) row normalization H_t, 12 LR + matrix routing S1/S2
Shampoo T2.2 Kron (Q_L,Q_R) m_t, Kron L_t,R_t metric-ball steepest Kron-Fisher, 14 LR + damping S1/S2/S3/S4
SOAP T2.2 Kron eigen (Q_L,Q_R) m_t,v_t, Kron box in eigenbasis rotated diag, 12 LR + WD S1/S2/S3/S4
MARS-Shampoo T2.2 Kron (Q_L,Q_R) c_t,m_t, Kron (STORM) metric-ball (VR) rotated diag, 12 LR + damping S0/S1/S2/S3/S4
GaLore T2.3 subspace (Q_L,Q_R) m_t, v_t (proj.) projected adapt.\ box diag( v_t), 12 proj-back + WD S2/S3/S4
T3: Discretized & Quantized Directions
Lion T3 full sign-mom.\ EMAs fixed _ box sign self-norm (I,0) LR + WD S2/S3
MARS-Lion T3 full c_t, sign-mom (STORM) fixed _ box (VR) sign self-norm (I,0) LR + WD S0/S2/S3
T4: State-Compressed Optimization
Adafactor T4.1 full, factored row/col 2nd-mom factors adapt.\ coord box factored diag, 12 LR + factored upd. S3
CAME T4.1 full, factored factors + confidence adapt.\ coord box factored diag + conf. LR + conf.\ corr. S3
Adam8bit T4.2 full, INT8 state quant.\ m_t,v_t adapt.\ _ box diag(v_t) INT8, 12 dequant + WD S3
Adam-mini T4.3 block-struct. block-shared m_t,v_t block adapt.\ box block-mean diag, 12 LR + WD S1/S3
APOLLO T4.3 rand-proj state projected estimator projected adapt.\ box proj.\ diag, 12 proj-back + LR S1/S3
Conda T4.3 block-struct. block-shared m_t,v_t block adapt.\ box block-mean diag, 12 LR + WD S1/S3
T5: Curvature-Aware & Geometry-Regularized
Sophia T5.2 full m_t, Hutch.\ HVP clipped local geom. Hutchinson, 1 LR + WD S3/S5
AdamP T5.3 full m_t,v_t adapt.\ _ box diag(v_t), 12 radial proj. + WD S5
LAMB T5.4 full m_t,v_t adapt.\ _ box diag(v_t), 12 trust-ratio + WD S1/S5
Benchmark evidence

Cross-scenario evidence, not one-setting ranking.

The webpage foregrounds the evidence used by the paper's optimizer-choice argument: Stage-1 C4 quality/runtime/memory trade-offs, Stage-2 FineWeb-Edu long-context transfer, and auxiliary O4/O5 stability and robustness probes.

Stage-1 Pareto frontiers for 1B models comparing perplexity with runtime and optimizer memory
Stage-1 1B Pareto frontiers: lower-left is better for PPL versus runtime and optimizer-state memory.
Quality frontier

APOLLO, Muon, MARS-Shampoo, and RMNP are competitive on short-context C4, but they occupy different cost regions.

Runtime frontier

Lion and AdamW are cheap; RMNP is the practical matrix-structured exception close to the efficient frontier.

Memory frontier

Adafactor, APOLLO, and GaLore reduce optimizer state, but memory wins do not automatically transfer to harder regimes.

Transfer frontier

FineWeb-Edu 32k turns optimizer choice into a cross-architecture stability question rather than an absolute PPL ranking.

Optimizer-level heatmap of C4 perplexity runtime and memory for 24 optimizers
Stage-1 optimizer-level heatmap of C4 PPL, runtime, and memory. It shows why the short-context screen is multi-objective rather than a flat leaderboard.
Rank stability across optimizers and architectures in FineWeb-Edu long-context scenarios
Stage-2 cross-scenario rank stability under FineWeb-Edu 32k. SOAP transfers most consistently; APOLLO's short-context strength does not survive this setting.
Findings

No optimizer dominates every objective frontier.

Structured-matrix methods transfer stably but can be expensive; state-compressed methods can win memory under short contexts but degrade as input complexity grows; rankings cross systematically across domains.

SOAP

Strongest long-context cross-scenario quality, but its runtime and optimizer-state memory make it a quality ceiling, not a default.

RMNP / Muon

Matrix geometry is useful but architecture-aware: RMNP is the balanced option, while Muon is mechanistically interpretable.

APOLLO / Adafactor

State compression is attractive under memory pressure, but APOLLO's short-context win collapses under long context.

AdamW / Lion

AdamW remains the stable reference anchor; Lion is cheap for exploration but carries an expected quality gap.

Family-level objective summary over six optimizer objectives
Family-level objective summary over O1-O6, connecting measured quality, runtime, memory, stability, robustness, and generalization back to the optimizer-family taxonomy.
Gradient-norm stability heatmap across optimizers and architectures
Auxiliary O4 stability analysis from gradient-norm dynamics: smooth training, final PPL, and transfer are related but distinct objectives.
Learning-rate perturbation robustness panels for optimizer sensitivity
O5 learning-rate perturbation robustness: tuned quality can hide sharp sensitivity when the learning rate is misspecified.
Mechanistic ablation of Muon on C4 350M showing core operations gain operations and ordering constraints
Mechanistic ablation of Muon on C4 350M. The full-width placement keeps the core recovery, gain design, and operator-order blocks readable.
Cross-scale/architecture validation of Muon's gain operations. Standard Muon, symmetric two-way LR scaling, post-NS Nesterov, and their combination; lower PPL is better. Gains stack on the standard Transformer but not on Gated DeltaNet.
Scenario Standard Muon Symmetric LR Scaling Post-NS Nesterov Both combined Best config.
Standard Transformer: gains are stackable
C4-LLaMA3, 350M 16.60 16.52 16.57 16.51 Both combined
C4-LLaMA3, 1B 13.72 13.64 13.64 13.58 Both combined
Linear attention: stacking effect disappears
FineWeb-Edu-32k, GDN-340M 24.26 24.02 24.12 24.12 Symmetric LR Scaling
Decision guide

Choose the optimizer by the binding constraint.

The paper's conclusion is not a single global winner. It is a constraint-matched decision rule: start from AdamW, then move only when quality, runtime, memory, stability, or cross-scenario transfer demands a different mechanism.

Reference anchor

AdamW

Use as the default baseline for general-purpose LLM pretraining: stable, inexpensive, interpretable, and the reference every other optimizer should beat.

Quality-efficiency balance

RMNP

Best practical alternative when a matrix-structured method is needed without the prohibitive runtime and memory cost of heavier preconditioners.

Quality ceiling

SOAP

Strongest long-context cross-architecture quality profile, useful when final quality dominates and compute/memory are not the bottleneck.

Mechanism analysis

Muon

Strong and transparent matrix-structured optimizer, but its behavior is topology-dependent and should be validated on the target architecture.

Memory pressure

Adafactor / APOLLO

Adafactor is the safer low-memory baseline. APOLLO is high reward but high risk: strong at short context, weak under long-context transfer.

Exploration budget

Lion

Cheap exploratory option with low per-step overhead, but the paper treats the quality gap as expected rather than incidental.

Tiered classification of the benchmarked optimizers.
Tier Optimizers
Tier I Muon, RMNP, AdamW
Tier II MARS-Lion, MARS-Shampoo, APOLLO, Conda, AdamP, MARS-AdamW, SOAP, Adan, Lion
Tier III RAdam, NAdam, Prodigy, AdaBelief, GaLore, Shampoo, Adam8bit, CAME, Adafactor, Adam-mini, LAMB, Sophia
Full evidence tables

Central benchmark tables are preserved in full.

Wide paper tables are rendered as responsive HTML tables with all source rows and columns. On narrow screens, rows turn into labeled cards instead of requiring horizontal scroll.

Stage-1 screening on C4 (LLaMA-3, seq.\ 256). C4 validation PPL, optimizer-state memory (Mem), and per-step runtime (T) at four scales; lower is better. Grouped by family, sorted by 1B PPL.
Optimizer Venue 60M PPL 60M Mem GB 60M T ms 130M PPL 130M Mem GB 130M T ms 350M PPL 350M Mem GB 350M T ms 1B PPL 1B Mem GB 1B T ms
Element-wise Adaptive Moment Estimation
Adan TPAMI'23 30.25 0.433 2.32 22.84 1.000 4.72 17.29 2.742 12.06 14.35 9.977 39.67
RAdam ICLR'20 30.12 0.217 1.53 23.22 0.500 3.07 17.34 1.371 7.64 14.47 4.989 23.79
AdamW ICLR'19 30.08 0.217 1.14 23.18 0.500 2.31 17.32 1.371 5.97 14.48 4.989 18.62
NAdam ICLR'18 33.72 0.217 3.45 24.51 0.500 4.93 17.90 1.371 9.96 14.67 4.989 20.91
MARS-AdamW ICML'25 30.01 0.325 7.62 22.86 0.750 11.05 16.95 2.057 22.12 14.90 7.483 34.70
Prodigy ICML'23 33.44 0.433 8.36 24.13 1.000 12.29 18.27 2.742 24.30 15.61 9.977 36.78
AdaBelief NeurIPS'20 30.08 0.433 5.76 23.45 1.000 8.55 17.61 2.742 19.10 16.79 9.977 55.48
Matrix-Structured Methods
MARS-Shampoo ICML'25 30.03 0.325 26.27 22.56 0.750 37.94 16.82 2.057 78.71 13.72 7.483 513.7
Muon arXiv'25 28.26 0.109 21.01 21.81 0.250 30.48 16.61 0.686 61.66 13.73 2.495 379.0
RMNP ICML'26 29.88 0.109 3.26 22.54 0.250 4.63 16.85 0.686 9.32 13.87 2.495 16.94
SOAP arXiv'24 29.47 0.731 50.58 22.67 2.214 110.4 17.14 7.465 302.5 14.04 29.299 1371.5
GaLore ICML'24 34.56 0.062 4.21 25.32 0.199 5.88 19.18 0.426 11.85 14.29 0.790 15.29
Shampoo arXiv'18 30.22 0.217 22.36 22.56 0.500 33.27 17.03 1.371 66.05 14.29 4.989 389.4
Discretized & Quantized Directions
MARS-Lion ICML'25 32.41 0.325 5.72 25.68 0.750 8.49 18.78 2.057 17.11 15.73 7.483 24.77
Lion arXiv'23 35.94 0.109 2.07 25.56 0.250 3.01 19.30 0.686 5.80 17.02 2.494 12.48
State-Compressed Optimization
APOLLO MLSys'25 30.86 0.062 8.62 22.74 0.149 12.65 16.43 0.426 26.21 13.53 0.790 28.65
Conda arXiv'25 28.65 0.245 4.88 21.91 0.595 7.11 16.45 1.703 13.90 14.25 6.317 62.33
Adam8bit ICLR'22 30.46 0.110 4.11 23.30 0.254 7.27 17.67 0.697 16.89 14.53 2.534 42.38
CAME ACL'23 31.40 0.218 14.99 23.79 0.502 21.76 17.60 1.376 44.89 14.53 4.997 87.46
Adafactor ICML'18 30.00 0.001 9.90 22.94 0.002 14.63 17.85 0.003 29.70 14.92 0.004 56.46
Adam-mini ICLR'25 30.50 0.109 5.68 23.62 0.251 8.31 18.12 0.686 16.68 15.51 2.495 20.81
Curvature-Aware & Geometry-Regularized
AdamP ICLR'21 30.21 0.217 12.82 23.07 0.500 19.13 17.39 1.371 39.98 14.57 4.989 64.69
LAMB ICLR'20 30.03 0.217 9.14 23.40 0.500 13.17 17.25 1.371 26.62 16.09 4.989 44.18
Sophia arXiv'23 36.27 0.217 3.92 25.76 0.500 5.66 18.86 1.371 11.06 16.45 4.989 20.05

Full Stage-1 C4 screening table: all optimizer families, four scales, and PPL/memory/runtime values are retained.

Stage-2 cross-architecture generalization (FineWeb-Edu, 32k). WikiText test PPL at 340M and 1B across four architectures. As absolute PPL is not comparable across architectures, the last columns give each optimizer's mean rank and range over the eight scenarios (lower is better). Per-scenario best in green; gray = AdamW.
Optimizer Transformer++ 340M Transformer++ 1B Gated DeltaNet 340M Gated DeltaNet 1B DeltaNet 340M DeltaNet 1B GLA 340M GLA 1B Mean rank Rank range
T1: Element-wise Adaptive Moment
MARS-AdamW 24.57 18.94 24.17 20.04 26.79 20.67 28.28 21.89 3.12 [2-5]
AdamW 24.62 18.90 24.47 20.33 27.16 20.66 28.67 22.06 4.62 [2-6]
Adan 25.55 19.41 24.78 20.55 27.28 20.88 29.00 22.51 7.12 [5-9]
T2: Matrix-Structured Methods
SOAP 23.90 18.72 23.85 19.86 26.02 20.38 27.04 20.62 1.12 [1-2]
RMNP 24.37 19.40 23.65 20.26 26.80 21.06 28.60 22.23 4.00 [1-7]
Muon 25.05 19.86 24.34 20.32 27.18 21.18 27.47 21.54 5.25 [2-8]
MARS-Shampoo 26.43 19.74 25.99 24.87 28.26 21.25 29.20 21.53 8.25 [2-11]
T3: Discretized Directions
Lion 26.02 20.26 24.76 20.38 28.20 21.44 29.47 22.40 8.25 [7-10]
MARS-Lion 26.20 21.17 25.24 22.20 28.25 22.72 29.67 23.79 10.00 [9-11]
T4: State-Compressed
Conda 28.30 19.86 26.11 21.07 29.09 21.75 37.38 22.89 10.25 [9-11]
APOLLO 34.08 25.29 30.36 29.29 34.73 25.58 37.75 27.78 12.00 [12-12]
T5: Curvature-Aware
AdamP 24.68 19.04 24.32 20.29 26.77 20.68 28.66 21.86 4.00 [2-5]

Full Stage-2 long-context cross-architecture table with all eight scenario values and rank summaries.

Citation

BibTeX

@article{li2025scaling,
  title = {{Scaling for Scaling: Taxonomy, Geometry, and Benchmarking of Modern Optimizers}},
  author = {Siyuan Li and Jiabao Pan and Yumou Liu and Zhuoli Ouyang and Xin Jin and Xinglong Xu and Jingxuan Wei and Shengye Pang and Jintao Chen and Xuanhe Zhou and Conghui He and Cheng Tan},
	  year = {2025}
	}
Expanded selected paper figure