model_ladder

class olmo_core.model_ladder.ModelLadder(*, name, dir, project=None, sizes, max_devices, device_type, model_configurator, run_configurator, data_loader, instance_sources, sequence_length=8192, tokenizer, seed=42, backend='cpu:gloo,cuda:nccl')[source]

Bases: Config

Represents a complete model ladder of runs.

name: str

A name to assign to the ladder.

dir: str

A unique directory where ladder run results and intermediate checkpoints should be saved.

project: Optional[str] = None

An optional project name to associate with the ladder runs. Defaults to name. This is used by some logging backends (e.g. Weights & Biases).

sizes: list[str]

A list of model size specs to run as part of the ladder.

max_devices: int

The number of accelerator devices available to use for each run.

device_type: str

The type of accelerator device available to use for each run (e.g. “NVIDIA H100 80GB HBM3”).

model_configurator: ModelConfigurator

The model configurator to use.

run_configurator: RunConfigurator

The run configurator to use.

data_loader: ComposableDataLoaderConfig

The data loader configuration to use for each run.

instance_sources: list[InstanceSourceConfig]

The instance sources to use for each run.

sequence_length: int = 8192

The sequence length to train each run on.

tokenizer: TokenizerConfig

The tokenizer to use.

seed: int = 42

The initial random seed to use for all runs in the ladder.

backend: str = 'cpu:gloo,cuda:nccl'

The distributed backend to use for each run.

dry_run(size_spec, show_plot=True, save_plot=None)[source]

Do a dry-run, which prints relevant hyperparameters, the required number of devices, and a displays a plot of the learning rate schedule.

run(size_spec, for_benchmarking=False)[source]

Execute a particular model run of the experiment locally and store the results.

run_benchmark(size_spec)[source]

Do a bench-marking run for a model of the given size spec. This is just like run(), but with benchmarking-specific settings (no checkpoints, no evals, hard stop).

get_model_config(size_spec)[source]

Get the model config for a model of the given size spec.

Return type:

ModelConfig

get_num_params(size_spec)[source]

Get the actual number of non-embedding parameters for a model of the given size spec.

get_num_devices(size_spec)[source]

Get the number of devices that would be used for a run of the given size spec.

Return type:

int

get_save_folder(size_spec)[source]

Get the training save folder for a run of the given size spec.

Return type:

str

get_checkpoints(size_spec, download_metrics=False, discover_all=False, alternative_dirs=None)[source]

Get the list of ordered checkpoints from the run for the given size spec.

Parameters:
  • size_spec (str) – The size specification for the model run.

  • download_metrics (bool, default: False) – If True, download metrics files to local cache.

  • discover_all (bool, default: False) – If True, discover all checkpoints that exist in the save folder rather than only checking at the intervals defined by RunConfigurator.configure_checkpoint_intervals().

  • alternative_dirs (Optional[list[Union[Path, PathLike, str]]], default: None) – Optional list of alternative root directories to search for checkpoints. The size_spec is appended to each directory. For each checkpoint, the primary save directory is checked first, then each alternative directory in order until found.

Return type:

list[RunCheckpointInfo]

get_metrics(size_spec, prefix=None, discover_all=False, alternative_dirs=None)[source]

Get the metrics from the run of the given size spec.

Parameters:
  • size_spec (str) – The size specification for the model run.

  • prefix (Optional[str], default: None) – If provided, only include metrics with keys starting with this prefix.

  • discover_all (bool, default: False) – If True, discover all checkpoints that exist in the save folder rather than only checking at the intervals defined by RunConfigurator.configure_checkpoint_intervals().

  • alternative_dirs (Optional[list[Union[Path, PathLike, str]]], default: None) – Optional list of alternative root directories to search for checkpoints and metrics files. The size_spec is appended to each directory.

Return type:

Optional[DataFrame]

class olmo_core.model_ladder.ModelConfigurator[source]

Bases: Config, Generic[M]

Defines how to configure a model of a particular size.

abstract configure_model(*, size_spec, sequence_length, tokenizer, device_type)[source]

Configure the model for the given size spec.

Return type:

TypeVar(M, bound= ModelConfig)

abstract configure_rank_microbatch_size(*, size_spec, sequence_length, device_type)[source]

Configure the training per-device micro-batch size in tokens for a model of this size.

Return type:

int

abstract configure_minimal_device_mesh_spec(*, size_spec, sequence_length, device_type)[source]

Configure the minimal device mesh spec needed to train a model of this size.

Return type:

DeviceMeshSpec

abstract build_train_module(*, size_spec, sequence_length, rank_microbatch_size, model_config, optim_config, scheduler, device_type)[source]

Build the train module for the given model and optimizer configs.

Return type:

TrainModule

class olmo_core.model_ladder.RunConfigurator[source]

Bases: Config

Defines how to configure a run for a model of a particular size.

abstract configure_target_batch_size(num_params)[source]

Get the target global batch size in tokens for a model of this size. The actual batch size used may be slightly different to ensure it’s a multiple of the data parallel world size times the device micro-batch size.

Return type:

int

abstract configure_duration(num_params, batch_size)[source]

Get the training duration for a given model and batch size.

Return type:

Duration

abstract configure_optimizer(num_params, batch_size)[source]

Get the optimizer config for a given model and batch size.

Return type:

OptimConfig

abstract configure_lr_scheduler(num_params, batch_size)[source]

Get the learning rate scheduler for a given model and batch size.

Return type:

Scheduler

abstract configure_checkpoint_intervals(num_params, batch_size)[source]

Get the checkpoint intervals for a given model and batch size. Returns a list of (checkpoint interval, checkpoint description) tuples.

Return type:

list[tuple[Duration, str]]

abstract plot_lr_schedule(num_params, batch_size, *, show=True, save_path=None)[source]

Render a plot of the learning rate schedule.

Return type:

Union[Path, PathLike, str, None]

class olmo_core.model_ladder.RunCheckpointInfo(name, step, tokens, path, metrics_path, exists)[source]

Bases: object

Describes a checkpoint from a model run.

name: str

A descriptive name for the checkpoint, assigned by the RunConfigurator.

step: int

The training step number of the checkpoint.

tokens: int

The number of training tokens processed up to this checkpoint.

path: Union[Path, PathLike, str]

A path to the checkpoint directory.

metrics_path: Union[Path, PathLike, str, None]

A path to the metrics JSON file for this checkpoint, if it exists.

exists: bool

Whether the checkpoint actually exists.

display()[source]

Get a rich-formatted string representation of the checkpoint info.

Return type:

str

class olmo_core.model_ladder.DeviceMeshSpec(world_size: int, dp_world_size: int | None)[source]

Bases: NamedTuple

Describes the relevant dimensions of a device mesh needed to train a model of a certain size.

world_size: int

The mininum numbers of devices required.

dp_world_size: Optional[int]

The mininum size of the data parallel group. This can be set to None if the data parallel world size should equal the world size. This, along with the per-device micro-batch size, is needed to determine the right global batch size.

class olmo_core.model_ladder.WSDSChinchillaRunConfigurator(*, chinchilla_multiple, decay_fraction=0.1, tokens_per_param=20, lr_multiplier=1.0, stepped_schedule=False)[source]

Bases: RunConfigurator

A run configurator that uses WSD-S learning rate scheduling and Chinchilla scaling laws.

Note

You may need to tune the tokens_per_param value to your dataset and optimizer.

chinchilla_multiple: float

How long to train each run for, expressed as a multiple of the Chinchilla-optimal duration which must be a power of 2.

decay_fraction: float = 0.1

The duration of each decay as a fraction of the period. Must be at least 10%.

tokens_per_param: int = 20

The number of tokens per parameter to use for Chinchilla calculations.

lr_multiplier: float = 1.0

A multiplier to apply to the learning rate calculated from Chinchilla scaling laws.

stepped_schedule: bool = False

If True, use a stepped schedule for the peak learning rate instead of a constant one, where the peak learning rate will be scaled by 1 / sqrt(D) during each stage, where D is the target chinchilla multiple of the stage. This assumes that the base learning rate is optimal for 1xC.

configure_target_batch_size(num_params)[source]

Get the target global batch size in tokens for a model of this size. The actual batch size used may be slightly different to ensure it’s a multiple of the data parallel world size times the device micro-batch size.

Return type:

int

configure_duration(num_params, batch_size)[source]

Get the training duration for a given model and batch size.

Return type:

Duration

configure_optimizer(num_params, batch_size)[source]

Get the optimizer config for a given model and batch size.

Return type:

OptimConfig

configure_lr_scheduler(num_params, batch_size)[source]

Get the learning rate scheduler for a given model and batch size.

Return type:

Scheduler

configure_checkpoint_intervals(num_params, batch_size)[source]

Get the checkpoint intervals for a given model and batch size. Returns a list of (checkpoint interval, checkpoint description) tuples.

Return type:

list[tuple[Duration, str]]

plot_lr_schedule(num_params, batch_size, *, show=True, save_path=None)[source]

Render a plot of the learning rate schedule.

Return type:

Union[Path, PathLike, str, None]

class olmo_core.model_ladder.TransformerModelConfigurator(*, rank_microbatch_size=None)[source]

Bases: ModelConfigurator[TransformerConfig]

Generic model configurator for transformer models.

rank_microbatch_size: Optional[int] = None

Optional fixed rank micro-batch size. If set, this value is used directly instead of computing it based on model size and device type.

configure_rank_microbatch_size(*, size_spec, sequence_length, device_type)[source]

Configure the training per-device micro-batch size in tokens for a model of this size.

Return type:

int

configure_minimal_device_mesh_spec(*, size_spec, sequence_length, device_type)[source]

Configure the minimal device mesh spec needed to train a model of this size.

Return type:

DeviceMeshSpec

build_train_module(*, size_spec, sequence_length, rank_microbatch_size, model_config, optim_config, scheduler, device_type)[source]

Build the train module for the given model and optimizer configs.

Return type:

TransformerTrainModule

class olmo_core.model_ladder.Olmo3ModelConfigurator(*, rank_microbatch_size=None, model_construction_kwargs=<factory>)[source]

Bases: TransformerModelConfigurator

Model configurator for Olmo 3 transformer models.

model_construction_kwargs: dict[str, Any]

Keyword arguments to pass to the model constructor.

configure_model(*, size_spec, sequence_length, tokenizer, device_type)[source]

Configure the model for the given size spec.

Return type:

TransformerConfig

class olmo_core.model_ladder.TransformerSize(value)[source]

Bases: StrEnum

An enumeration.