optim

class olmo_core.optim.OptimConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), *, type=None)[source]

Bases: Config, Registrable, Generic[Opt]

Base class for Optimizer configs.

group_overrides: Optional[List[OptimGroupOverride]] = None

Use this to pull out groups parameters into a separate param groups with their own options.

compile: bool = False

Compile the optimizer step.

Warning

Optimizer step compilation is still in beta and may not work with some optimizers. You could also see unexpected behavior and very poor performance when turning this feature on in the middle of a run that was previously trained without compiling the optimizer due to the LR being restored to a float instead of a tensor.

fixed_fields: Tuple[str, ...] = ('initial_lr',)

These are fields that should not be overridden by the value in a checkpoint after loading optimizer state.

build_groups(model, strict=True)[source]

Build parameters groups.

Parameters:
  • model (Module) – The model to optimize.

  • strict (bool, default: True) – If True an error is raised if a pattern in group_overrides doesn’t match any parameter.

Return type:

Union[Iterable[Tensor], List[Dict[str, Any]]]

abstract classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[TypeVar(Opt, bound= Optimizer)]

build(model, strict=True)[source]

Build the optimizer. This default implementation is suitable for standard, point-wise optimizers such as AdamW, Lion, etc.

Parameters:

strict (bool, default: True) – If True an error is raised if a pattern in group_overrides doesn’t match any parameter.

Return type:

TypeVar(Opt, bound= Optimizer)

class olmo_core.optim.MatrixAwareOptimConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), *, type=None)[source]

Bases: OptimConfig, Generic[Opt]

Configuration class for building a matrix-aware optimizer.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[TypeVar(Opt, bound= Optimizer)]

default_group_overrides(model)[source]

Default group overrides for matrix-aware optimizers.

Return type:

List[OptimGroupOverride]

build_groups(model, strict=True)[source]

Build parameters groups.

Parameters:
  • model (Module) – The model to optimize.

  • strict (bool, default: True) – If True an error is raised if a pattern in group_overrides doesn’t match any parameter.

Return type:

Union[Iterable[Tensor], list[dict[str, Any]]]

create_optimizer(model, strict=True)[source]

Create the optimizer.

Return type:

TypeVar(Opt, bound= Optimizer)

build(model, strict=True)[source]

Build the optimizer.

Parameters:

strict (bool, default: True) – If True an error is raised if a pattern in group_overrides doesn’t match any parameter.

Return type:

TypeVar(Opt, bound= Optimizer)

class olmo_core.optim.OptimGroupOverride(params, opts)[source]

Bases: Config

params: List[str]

A list of fully qualified parameter names (FQNs) or wild card to match FQNs.

opts: Dict[str, Any]

Options to set in the corresponding param group.

class olmo_core.optim.SkipStepOptimizer(params, defaults, rolling_interval_length=128, sigma_factor=6)[source]

Bases: Optimizer

A SkipStepOptimizer is an optimizer that can skip updates when the loss or gradient norm for a step is above a certain threshold of standard deviations computed over a rolling interval.

Important

When using a SkipStepOptimizer you must always set latest_loss and latest_grad_norm to the current loss and grad norm, respectively, before calling step().

The TransformerTrainModule will automatically set the latest_loss and latest_grad_norm whenever its optimizer is a subclass of SkipStepOptimizer.

Tip

When implementing a SkipStepOptimizer you should be careful to avoid host-device syncs. You can use get_step_factor() within your step() method to do this. See the implementation of SkipStepLion for an example.

get_step_factor()[source]

Returns a float tensor which will be 1.0 if the optimizer should proceed with the step and 0.0 if the optimizer should skip the step.

The tensor can be used within the optimizer’s step computation to essentially skip a step without a host-device sync.

Return type:

Tensor

property step_skipped: Tensor

Returns a float tensor which will be 1.0 if the step was skipped and 0.0 otherwise.

class olmo_core.optim.AdamWConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, foreach=None, fused=None, *, type=None)[source]

Bases: OptimConfig[AdamW]

Configuration class for building an torch.optim.AdamW optimizer.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[AdamW]

registered_base

alias of OptimConfig

class olmo_core.optim.SkipStepAdamWConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, dtype=None, foreach=True, step_increment_bugfix=True, rolling_interval_length=128, sigma_factor=6, *, type=None)[source]

Bases: OptimConfig[SkipStepAdamW]

Configuration class for building a SkipStepAdamW optimizer.

foreach: bool = True

Whether to use multi-tensor (foreach) kernels for the AdamW update. Faster than the non-foreach version.

step_increment_bugfix: bool = True

Whether or not to fix the step-incrementing bug discovered in SkipStepAdamW.

If this flag is set to False, the step will not be incremented, which gives the optimizer an effective lr that is 2.2x higher than the specified lr, and no bias correction is applied.

registered_base

alias of OptimConfig

rolling_interval_length: int = 128

The length of the rolling interval to use for computing the mean and standard deviation of the loss.

sigma_factor: int = 6

The number of standard deviations above the mean loss to skip a step.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[SkipStepAdamW]

class olmo_core.optim.SkipStepAdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, rolling_interval_length=128, sigma_factor=6, dtype=None, foreach=False, step_increment_bugfix=True)[source]

Bases: SkipStepOptimizer

A “skip step” version of AdamW.

property step_skipped: Tensor

Returns a float tensor which will be 1.0 if the step was skipped and 0.0 otherwise.

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Return type:

None

class olmo_core.optim.AdamConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, betas=(0.9, 0.999), eps=1e-08, foreach=None, fused=None, *, type=None)[source]

Bases: OptimConfig[Adam]

Configuration class for building an torch.optim.Adam optimizer.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[Adam]

registered_base

alias of OptimConfig

class olmo_core.optim.LionConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0, *, type=None)[source]

Bases: OptimConfig[Lion]

Configuration class for building a Lion optimizer.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[Lion]

registered_base

alias of OptimConfig

class olmo_core.optim.Lion(params, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0)[source]

Bases: Optimizer

An implementation of the Lion optimizer.

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Return type:

None

class olmo_core.optim.MuonConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.01, mu=0.95, betas=(0.9, 0.95), weight_decay=0.1, cautious_wd=False, nesterov=False, adjust_lr='rms_norm', flatten=False, use_triton=False, *, type=None)[source]

Bases: MatrixAwareOptimConfig

Configuration class for building a Muon optimizer.

Muon internally runs standard SGD-momentum, and then performs an orthogonalization post- processing step, in which each 2D parameter’s update is replaced with the nearest orthogonal matrix.

Muon is only used for hidden weight layers. The input embedding, final output layer, and any internal gains or biases are optimized using AdamW.

Muon supports FSDP and HSDP parallelism strategies. Flattened mesh dimensions (eg. “dp_ep” and “dp_cp”) can be supported but are currently not implemented.

lr: float = 0.01

Base learning rate. For Muon, this will be scaled based on the matrix dimensions. For AdamW, this is the actual learning rate and no additional scaling is done.

mu: float = 0.95

Momentum for Muon

betas: Tuple[float, float] = (0.9, 0.95)

Betas for AdamW

weight_decay: float = 0.1

Weight decay factor for non-embedding parameters

cautious_wd: bool = False

Whether to apply weight decay only where update and parameter signs align.

nesterov: bool = False

Whether to use Nesterov momentum.

adjust_lr: Optional[MuonAdjustLRStrategy] = 'rms_norm'

How to adjust the learning rate for Muon updates.

flatten: bool = False

Whether to flatten 3D+ tensors to 2D for Muon updates. Use this for convolutional layers.

registered_base

alias of OptimConfig

use_triton: bool = False

Whether to use optimized Triton kernels for Newton-Schulz iteration. Becauser the result of X@X.t is symmetric, we can avoid computing the upper triangular part of the matrix output. See: https://www.lakernewhouse.com/assets/writing/faster-symmul-with-thunderkittens.pdf

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

type

default_group_overrides(model)[source]

Split the model parameters into Adam and Muon groups. Only >=2d, internal parameters are meant to be optimized with Muon.

Return type:

list[OptimGroupOverride]

build_groups(model, strict=True)[source]

Build parameters groups.

Parameters:
  • model (Module) – The model to optimize.

  • strict (bool, default: True) – If True an error is raised if a pattern in group_overrides doesn’t match any parameter.

Return type:

Union[Iterable[Tensor], list[dict[str, Any]]]

build_parallelism_config()[source]

Prepare device mesh for Muon optimizer based on the parallelism configuration.

Muon requires a single 1D DeviceMesh for distributed training: - Single-device: Returns None - FSDP: Returns the DP mesh (parameter sharding mesh) - HSDP: Returns the DP shard mesh (the 1D sharded sub-mesh)

Note: TP is not directly supported by Muon. For TP configurations, you may need to handle tensor parallelism separately.

Return type:

dict[str, Optional[DeviceMesh]]

Returns:

1D DeviceMesh for distributed Muon, or None for single-device.

create_optimizer(model, strict=True, **kwargs)[source]

Create the optimizer.

class olmo_core.optim.NorMuonConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.01, mu=0.95, betas=(0.9, 0.95), weight_decay=0.1, cautious_wd=False, nesterov=False, adjust_lr='rms_norm', flatten=False, use_triton=False, *, type=None)[source]

Bases: MuonConfig

Configuration class for building a NorMuon optimizer.

NorMuon is a variant of Muon that adds neuron-wise adaptive learning rates. https://arxiv.org/abs/2510.05491

registered_base

alias of OptimConfig

muon_beta2: float = 0.95

Beta2 for Muon

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

type

class olmo_core.optim.DionConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.01, mu=0.95, betas=(0.9, 0.95), weight_decay=0.1, rank_fraction=1.0, rank_multiple_of=1, *, type=None)[source]

Bases: MatrixAwareOptimConfig

Configuration class for building a Dion optimizer.

Dion is a Muon-like optimizer that is designed to be scalable for DP-replicated, DP-sharded, and TP-sharded models. See https://arxiv.org/abs/2504.05295 for more details.

Dion supports FSDP, HSDP, and TP parallelism strategies. Flattened mesh dimensions (eg. “dp_ep” and “dp_cp”) can be supported but are currently not implemented.

lr: float = 0.01

Base learning rate. For Dion, this will be scaled based on the matrix dimensions. For AdamW, this is the actual learning rate and no additional scaling is done.

mu: float = 0.95

Momentum for Dion

betas: Tuple[float, float] = (0.9, 0.95)

Betas for AdamW

weight_decay: float = 0.1

Weight decay for non-embedding parameters

rank_fraction: float = 1.0

Rank fraction for Dion. Set to 1.0 for full-rank optimization.

registered_base

alias of OptimConfig

rank_multiple_of: int = 1

Round up the low-rank dimension to a multiple of this number. This may be useful to ensure even sharding.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

type

default_group_overrides(model)[source]

Apply Dion’s parameter grouping rules.

Return type:

list[OptimGroupOverride]

build_parallelism_config()[source]

Prepare device meshes for Dion optimizer based on the parallelism configuration.

Supports: - Single-device: All meshes are None - FSDP: outer_shard_mesh = DP mesh, replicate_mesh = None - HSDP: replicate_mesh = DP replicate mesh, outer_shard_mesh = DP shard mesh - TP: inner_shard_mesh = TP mesh (can be combined with FSDP or HSDP)

Return type:

dict[str, Optional[DeviceMesh]]

Returns:

Dictionary with ‘replicate_mesh’, ‘outer_shard_mesh’, and ‘inner_shard_mesh’ keys.

create_optimizer(model, strict=True, **kwargs)[source]

Create the optimizer.

class olmo_core.optim.SkipStepLionConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0, rolling_interval_length=128, sigma_factor=6, *, type=None)[source]

Bases: OptimConfig[SkipStepLion]

Configuration class for building a SkipStepLion optimizer.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[SkipStepLion]

registered_base

alias of OptimConfig

class olmo_core.optim.SkipStepLion(params, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0, rolling_interval_length=128, sigma_factor=6)[source]

Bases: SkipStepOptimizer

A “skip step” version of Lion.

property step_skipped: Tensor

Returns a float tensor which will be 1.0 if the step was skipped and 0.0 otherwise.

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Return type:

None

class olmo_core.optim.NoOpConfig(group_overrides=None, compile=False, fixed_fields=('initial_lr',), lr=0.001, rolling_interval_length=128, sigma_factor=6, *, type=None)[source]

Bases: OptimConfig[NoOpOptimizer]

Configuration class for building a NoOpOptimizer.

This optimizer performs no parameter updates but maintains step skipping logic for gathering statistics during training.

lr: float = 0.001

Learning rate (not used for updates, but maintained for compatibility).

rolling_interval_length: int = 128

The length of the rolling interval to use for computing the mean and standard deviation of the loss and gradient norm.

registered_base

alias of OptimConfig

sigma_factor: int = 6

The number of standard deviations above the mean loss/grad norm to skip a step.

classmethod optimizer()[source]

Get the optimizer class associated with this config.

Return type:

Type[NoOpOptimizer]

class olmo_core.optim.NoOpOptimizer(params, lr=0.001, rolling_interval_length=128, sigma_factor=6)[source]

Bases: SkipStepOptimizer

A no-op optimizer that performs no parameter updates but maintains all step skipping logic.

This optimizer is useful for gathering statistics from training without actually modifying the model parameters. It tracks losses and gradient norms, computes step factors based on rolling statistics, but does not apply any updates to the model.

property step_skipped: Tensor

Returns a float tensor which will be 1.0 if the step was skipped and 0.0 otherwise.

step(closure=None)[source]

Perform a single optimization step to update parameter.

Parameters:

closure (Callable) – A closure that reevaluates the model and returns the loss. Optional for most optimizers.

Return type:

None

class olmo_core.optim.Scheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', *, type=None)[source]

Bases: Config, Registrable

Learning rate scheduler base class.

abstract get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

set_lr(group, trainer)[source]

Set the learning rate on an optimizer param group given a trainer’s state.

Return type:

Union[float, Tensor]

class olmo_core.optim.SchedulerUnits(value)[source]

Bases: StrEnum

An enumeration.

class olmo_core.optim.ConstantScheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', *, type=None)[source]

Bases: Scheduler

Constant learning rate schedule, basically a no-op.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.ConstantWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Constant learning rate schedule with a warmup.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.CosWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, alpha_f=0.1, t_max=None, warmup_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Cosine learning rate schedule with a warmup.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.CosWithWarmupAndLinearDecay(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, alpha_f=0.1, t_max=None, warmup_min_lr=0.0, decay=None, decay_steps=None, decay_fraction=0.1, decay_min_lr=0.0, *, type=None)[source]

Bases: CosWithWarmup

Cosine learning rate schedule with a warmup, cut short at the end and followed by a linear decay.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.ExponentialScheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', lr_min=1e-09, *, type=None)[source]

Bases: Scheduler

Exponential learning rate schedule that increases from a minimum LR to a maximum LR. Thus:
  • lr(0) = lr_min

  • lr(t_max) = initial_lr

registered_base

alias of Scheduler

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

class olmo_core.optim.HalfCosWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, alpha_f=0.1, t_max=None, warmup_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Second half of a cosine learning rate schedule, with a warmup before that. Note: This assumes that the peak LR set is for the full cosine schedule.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.InvSqrtWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', alpha_f=0.1, warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Inverse square root learning rate (LR) schedule with a warmup.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.LinearWithWarmup(lr_field='lr', initial_lr_field='initial_lr', units='steps', alpha_f=0.1, t_max=None, warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Linear learning rate schedule with a warmup.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.SequentialScheduler(lr_field='lr', initial_lr_field='initial_lr', units='steps', schedulers=<factory>, schedulers_max=None, schedulers_max_steps=None, override_decay=None, *, type=None)[source]

Bases: Scheduler

A scheduler that calls a sequence of schedulers sequentially during the optimization process. The initial LR of a scheduler in the sequence is set to the final LR of the previous scheduler.

schedulers_max: Optional[List[int]] = None

A list of the steps or token counts for which each scheduler runs. The last scheduler is assumed to run until the end of training, so any value provided for it is ignored.

override_decay: Optional[OverrideDecay] = None

Optional late-stage override. When current >= override_decay.start, the sub-scheduler sequence is bypassed and the LR decays from “whatever the main sequence would have produced at start” to the override’s target over duration (linear or cosine). After start + duration, the LR is held at the override’s end LR.

Note

While the override is active, t_max is ignored — the override is defined absolutely by start and duration.

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.WSD(lr_field='lr', initial_lr_field='initial_lr', units='steps', warmup=None, warmup_steps=None, warmup_fraction=None, decay=None, decay_steps=None, decay_fraction=0.1, warmup_min_lr=0.0, decay_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Warmup-stable-decay scheduler

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.WSDS(lr_field='lr', initial_lr_field='initial_lr', units='steps', period_lengths=<factory>, period_lr_multipliers=None, warmup=None, warmup_fraction=None, decay=None, decay_fraction=None, warmup_min_lr=0.0, decay_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Warmup–Stable–Decay—Simplified (WSD‑S) scheduler for continual pretraining. Reference: https://arxiv.org/abs/2410.05192

get_lr(initial_lr, current, t_max)[source]

Get the learning rate given the initial/max learning rate, current step/token count, and the maximum number of steps/tokens.

Return type:

Union[float, Tensor]

registered_base

alias of Scheduler

class olmo_core.optim.PowerLR(lr_field='lr', initial_lr_field='initial_lr', units='steps', b=-0.51, warmup=None, warmup_steps=None, warmup_fraction=None, warmup_min_lr=0.0, decay=None, decay_steps=None, decay_fraction=0.1, decay_min_lr=0.0, *, type=None)[source]

Bases: Scheduler

Power learning‑rate schedule with

  1. Linear warm‑up to a reference peak LR (initial_lr) during the first warmup steps/tokens.

  2. Power phase where the LR decays following a power‑law lr = initial_lr * (current / warmup) ** b. This makes the LR independent of the eventual training horizon.

  3. Optional linear decay tail during the last decay steps/tokens to smoothly anneal to decay_min_lr.

Notes

  • b should be negative (e.g. ‑0.51); magnitude controls how quickly the LR decays in the power phase.

  • If both warmup and warmup_fraction (or both decay and decay_fraction) are specified, an OLMoConfigurationError is raised to mirror the behaviour of other schedulers in this file.

get_lr(initial_lr, current, t_max)[source]

Compute the learning rate for the given current step/token count.

Return type:

Union[float, Tensor]

Linear warm‑up:

lr = warmup_min_lr + (initial_lr - warmup_min_lr) * current / warmup

Power phase:

lr = initial_lr * (current / warmup) ** b

Linear decay tail (last decay steps/tokens):

lr is linearly annealed from the power‑phase value at the start of the tail to decay_min_lr.

registered_base

alias of Scheduler