optim¶
- class olmo_core.optim.OptimConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), *, type=None)[source]¶
Bases:
Config,Registrable,Generic[Opt]Base class for
Optimizerconfigs.-
group_overrides:
Optional[List[OptimGroupOverride]] = None¶ Use this to pull out groups parameters into a separate param groups with their own options.
-
compile:
bool= False¶ Compile the optimizer step.
Warning
Optimizer step compilation is still in beta and may not work with some optimizers. You could also see unexpected behavior and very poor performance when turning this feature on in the middle of a run that was previously trained without compiling the optimizer due to the LR being restored to a float instead of a tensor.
-
fixed_fields:
Tuple[str,...] = ('initial_lr',)¶ These are fields that should not be overridden by the value in a checkpoint after loading optimizer state.
-
group_overrides:
- class olmo_core.optim.MatrixAwareOptimConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), *, type=None)[source]¶
Bases:
OptimConfig,Generic[Opt]Configuration class for building a matrix-aware optimizer.
- default_group_overrides(model)[source]¶
Default group overrides for matrix-aware optimizers.
- Return type:
- class olmo_core.optim.SkipStepOptimizer(params, defaults, rolling_interval_length=128, sigma_factor=6)[source]¶
Bases:
OptimizerA
SkipStepOptimizeris an optimizer that can skip updates when the loss or gradient norm for a step is above a certain threshold of standard deviations computed over a rolling interval.Important
When using a
SkipStepOptimizeryou must always setlatest_lossandlatest_grad_normto the current loss and grad norm, respectively, before callingstep().The
TransformerTrainModulewill automatically set thelatest_lossandlatest_grad_normwhenever its optimizer is a subclass ofSkipStepOptimizer.Tip
When implementing a
SkipStepOptimizeryou should be careful to avoid host-device syncs. You can useget_step_factor()within yourstep()method to do this. See the implementation ofSkipStepLionfor an example.
- class olmo_core.optim.AdamWConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, foreach=None, fused=None, *, type=None)[source]¶
Bases:
OptimConfig[AdamW]Configuration class for building an
torch.optim.AdamWoptimizer.- registered_base¶
alias of
OptimConfig
- class olmo_core.optim.SkipStepAdamWConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, dtype=None, foreach=True, step_increment_bugfix=True, rolling_interval_length=128, sigma_factor=6, *, type=None)[source]¶
Bases:
OptimConfig[SkipStepAdamW]Configuration class for building a
SkipStepAdamWoptimizer.-
foreach:
bool= True¶ Whether to use multi-tensor (foreach) kernels for the AdamW update. Faster than the non-foreach version.
-
step_increment_bugfix:
bool= True¶ Whether or not to fix the step-incrementing bug discovered in SkipStepAdamW.
If this flag is set to False, the step will not be incremented, which gives the optimizer an effective lr that is 2.2x higher than the specified lr, and no bias correction is applied.
- registered_base¶
alias of
OptimConfig
-
rolling_interval_length:
int= 128¶ The length of the rolling interval to use for computing the mean and standard deviation of the loss.
-
foreach:
- class olmo_core.optim.SkipStepAdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, rolling_interval_length=128, sigma_factor=6, dtype=None, foreach=False, step_increment_bugfix=True)[source]¶
Bases:
SkipStepOptimizerA “skip step” version of
AdamW.
- class olmo_core.optim.AdamConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, betas=(0.9, 0.999), eps=1e-08, foreach=None, fused=None, *, type=None)[source]¶
Bases:
OptimConfig[Adam]Configuration class for building an
torch.optim.Adamoptimizer.- registered_base¶
alias of
OptimConfig
- class olmo_core.optim.LionConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0, *, type=None)[source]¶
Bases:
OptimConfig[Lion]Configuration class for building a
Lionoptimizer.- registered_base¶
alias of
OptimConfig
- class olmo_core.optim.Lion(params, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0)[source]¶
Bases:
OptimizerAn implementation of the Lion optimizer.
- class olmo_core.optim.MuonConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.01, mu=0.95, betas=(0.9, 0.95), weight_decay=0.1, cautious_wd=False, nesterov=False, adjust_lr='rms_norm', flatten=False, use_triton=False, *, type=None)[source]¶
Bases:
MatrixAwareOptimConfigConfiguration class for building a
Muonoptimizer.Muon internally runs standard SGD-momentum, and then performs an orthogonalization post- processing step, in which each 2D parameter’s update is replaced with the nearest orthogonal matrix.
Muon is only used for hidden weight layers. The input embedding, final output layer, and any internal gains or biases are optimized using AdamW.
Muon supports FSDP and HSDP parallelism strategies. Flattened mesh dimensions (eg. “dp_ep” and “dp_cp”) can be supported but are currently not implemented.
-
lr:
float= 0.01¶ Base learning rate. For Muon, this will be scaled based on the matrix dimensions. For AdamW, this is the actual learning rate and no additional scaling is done.
-
cautious_wd:
bool= False¶ Whether to apply weight decay only where update and parameter signs align.
-
adjust_lr:
Optional[MuonAdjustLRStrategy] = 'rms_norm'¶ How to adjust the learning rate for Muon updates.
-
flatten:
bool= False¶ Whether to flatten 3D+ tensors to 2D for Muon updates. Use this for convolutional layers.
- registered_base¶
alias of
OptimConfig
-
use_triton:
bool= False¶ Whether to use optimized Triton kernels for Newton-Schulz iteration. Becauser the result of X@X.t is symmetric, we can avoid computing the upper triangular part of the matrix output. See: https://www.lakernewhouse.com/assets/writing/faster-symmul-with-thunderkittens.pdf
- default_group_overrides(model)[source]¶
Split the model parameters into Adam and Muon groups. Only >=2d, internal parameters are meant to be optimized with Muon.
- Return type:
- build_parallelism_config()[source]¶
Prepare device mesh for Muon optimizer based on the parallelism configuration.
Muon requires a single 1D DeviceMesh for distributed training: - Single-device: Returns None - FSDP: Returns the DP mesh (parameter sharding mesh) - HSDP: Returns the DP shard mesh (the 1D sharded sub-mesh)
Note: TP is not directly supported by Muon. For TP configurations, you may need to handle tensor parallelism separately.
- Return type:
- Returns:
1D DeviceMesh for distributed Muon, or None for single-device.
-
lr:
- class olmo_core.optim.NorMuonConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.01, mu=0.95, betas=(0.9, 0.95), weight_decay=0.1, cautious_wd=False, nesterov=False, adjust_lr='rms_norm', flatten=False, use_triton=False, *, type=None)[source]¶
Bases:
MuonConfigConfiguration class for building a
NorMuonoptimizer.NorMuon is a variant of Muon that adds neuron-wise adaptive learning rates. https://arxiv.org/abs/2510.05491
- registered_base¶
alias of
OptimConfig
- class olmo_core.optim.DionConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.01, mu=0.95, betas=(0.9, 0.95), weight_decay=0.1, rank_fraction=1.0, rank_multiple_of=1, *, type=None)[source]¶
Bases:
MatrixAwareOptimConfigConfiguration class for building a
Dionoptimizer.Dion is a Muon-like optimizer that is designed to be scalable for DP-replicated, DP-sharded, and TP-sharded models. See https://arxiv.org/abs/2504.05295 for more details.
Dion supports FSDP, HSDP, and TP parallelism strategies. Flattened mesh dimensions (eg. “dp_ep” and “dp_cp”) can be supported but are currently not implemented.
-
lr:
float= 0.01¶ Base learning rate. For Dion, this will be scaled based on the matrix dimensions. For AdamW, this is the actual learning rate and no additional scaling is done.
- registered_base¶
alias of
OptimConfig
-
rank_multiple_of:
int= 1¶ Round up the low-rank dimension to a multiple of this number. This may be useful to ensure even sharding.
- build_parallelism_config()[source]¶
Prepare device meshes for Dion optimizer based on the parallelism configuration.
Supports: - Single-device: All meshes are None - FSDP: outer_shard_mesh = DP mesh, replicate_mesh = None - HSDP: replicate_mesh = DP replicate mesh, outer_shard_mesh = DP shard mesh - TP: inner_shard_mesh = TP mesh (can be combined with FSDP or HSDP)
- Return type:
- Returns:
Dictionary with ‘replicate_mesh’, ‘outer_shard_mesh’, and ‘inner_shard_mesh’ keys.
-
lr:
- class olmo_core.optim.SkipStepLionConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0, rolling_interval_length=128, sigma_factor=6, *, type=None)[source]¶
Bases:
OptimConfig[SkipStepLion]Configuration class for building a
SkipStepLionoptimizer.- registered_base¶
alias of
OptimConfig
- class olmo_core.optim.SkipStepLion(params, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0, rolling_interval_length=128, sigma_factor=6)[source]¶
Bases:
SkipStepOptimizerA “skip step” version of
Lion.
- class olmo_core.optim.NoOpConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, rolling_interval_length=128, sigma_factor=6, *, type=None)[source]¶
Bases:
OptimConfig[NoOpOptimizer]Configuration class for building a
NoOpOptimizer.This optimizer performs no parameter updates but maintains step skipping logic for gathering statistics during training.
-
rolling_interval_length:
int= 128¶ The length of the rolling interval to use for computing the mean and standard deviation of the loss and gradient norm.
- registered_base¶
alias of
OptimConfig
-
sigma_factor:
int= 6¶ The number of standard deviations above the mean loss/grad norm to skip a step.
-
rolling_interval_length:
- class olmo_core.optim.NoOpOptimizer(params, lr=0.001, rolling_interval_length=128, sigma_factor=6)[source]¶
Bases:
SkipStepOptimizerA no-op optimizer that performs no parameter updates but maintains all step skipping logic.
This optimizer is useful for gathering statistics from training without actually modifying the model parameters. It tracks losses and gradient norms, computes step factors based on rolling statistics, but does not apply any updates to the model.
- class olmo_core.optim.Scheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', *, type=None)[source]¶
Bases:
Config,RegistrableLearning rate scheduler base class.
- class olmo_core.optim.ConstantScheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', *, type=None)[source]¶
Bases:
SchedulerConstant learning rate schedule, basically a no-op.
- class olmo_core.optim.ConstantWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerConstant learning rate schedule with a warmup.
- class olmo_core.optim.CosWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, alpha_f=0.1, t_max=None, warmup_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerCosine learning rate schedule with a warmup.
- class olmo_core.optim.CosWithWarmupAndLinearDecay(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, alpha_f=0.1, t_max=None, warmup_min_lr=0.0, decay=None, decay_steps=None, decay_fraction=0.1, decay_min_lr=0.0, *, type=None)[source]¶
Bases:
CosWithWarmupCosine learning rate schedule with a warmup, cut short at the end and followed by a linear decay.
- class olmo_core.optim.ExponentialScheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', lr_min=1e-09, *, type=None)[source]¶
Bases:
Scheduler- Exponential learning rate schedule that increases from a minimum LR to a maximum LR. Thus:
lr(0) = lr_min
lr(t_max) = initial_lr
- class olmo_core.optim.HalfCosWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, alpha_f=0.1, t_max=None, warmup_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerSecond half of a cosine learning rate schedule, with a warmup before that. Note: This assumes that the peak LR set is for the full cosine schedule.
- class olmo_core.optim.InvSqrtWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', alpha_f=0.1, warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerInverse square root learning rate (LR) schedule with a warmup.
- class olmo_core.optim.LinearWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', alpha_f=0.1, t_max=None, warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerLinear learning rate schedule with a warmup.
- class olmo_core.optim.SequentialScheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', schedulers=<factory>, schedulers_max=None, schedulers_max_steps=None, override_decay=None, *, type=None)[source]¶
Bases:
SchedulerA scheduler that calls a sequence of schedulers sequentially during the optimization process. The initial LR of a scheduler in the sequence is set to the final LR of the previous scheduler.
-
schedulers_max:
Optional[List[int]] = None¶ A list of the steps or token counts for which each scheduler runs. The last scheduler is assumed to run until the end of training, so any value provided for it is ignored.
-
override_decay:
Optional[OverrideDecay] = None¶ Optional late-stage override. When
current >= override_decay.start, the sub-scheduler sequence is bypassed and the LR decays from “whatever the main sequence would have produced atstart” to the override’s target overduration(linear or cosine). Afterstart + duration, the LR is held at the override’s end LR.Note
While the override is active,
t_maxis ignored — the override is defined absolutely bystartandduration.
-
schedulers_max:
- class olmo_core.optim.WSD(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, decay=None, decay_steps=None, decay_fraction=0.1, warmup_min_lr=0.0, decay_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerWarmup-stable-decay scheduler
- class olmo_core.optim.WSDS(lr_field='lr', initial_lr_field='initial_lr', units='steps', period_lengths=<factory>, period_lr_multipliers=None, warmup=None, warmup_fraction=None, decay=None, decay_fraction=None, warmup_min_lr=0.0, decay_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerWarmup–Stable–Decay—Simplified (WSD‑S) scheduler for continual pretraining. Reference: https://arxiv.org/abs/2410.05192
- class olmo_core.optim.PowerLR(lr_field='lr', initial_lr_field='initial_lr', units='steps', b=-0.51, warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, decay=None, decay_steps=None, decay_fraction=0.1, decay_min_lr=0.0, *, type=None)[source]¶
Bases:
SchedulerPower learning‑rate schedule with
Linear warm‑up to a reference peak LR (initial_lr) during the first warmup steps/tokens.
Power phase where the LR decays following a power‑law lr = initial_lr * (current / warmup) ** b. This makes the LR independent of the eventual training horizon.
Optional linear decay tail during the last decay steps/tokens to smoothly anneal to decay_min_lr.
Notes
b should be negative (e.g. ‑0.51); magnitude controls how quickly the LR decays in the power phase.
If both warmup and warmup_fraction (or both decay and decay_fraction) are specified, an OLMoConfigurationError is raised to mirror the behaviour of other schedulers in this file.
- get_lr(initial_lr, current, t_max)[source]¶
Compute the learning rate for the given current step/token count.
- Linear warm‑up:
lr = warmup_min_lr + (initial_lr - warmup_min_lr) * current / warmup
- Power phase:
lr = initial_lr * (current / warmup) ** b
- Linear decay tail (last
decaysteps/tokens): lr is linearly annealed from the power‑phase value at the start of the tail to
decay_min_lr.