`train.callbacks`¶

Trainer Callback implementations.

class olmo_core.train.callbacks.Callback[source]¶

Bases: Stateful

Trainer callback base class.

Callbacks can be used to modify and extend the behavior of the trainer loop. This module contains a number of useful Callback implementations, but you can always add your own.

priority: ClassVar[int] = 0¶: Priority of the callback. Determines the order in which callbacks run relative to each other. The higher the priority, the earlier a callback runs.

state_dict()[source]¶

Get the state dict to save.

Return type:: Dict[str, Any]

load_state_dict(state_dict)[source]¶: Load a state dict.

block_ephemeral_checkpoints()[source]¶: Register this callback as blocking ephemeral checkpoint saves. Ephemeral saves are blocked as long as at least one callback is registered.

unblock_ephemeral_checkpoints()[source]¶: Unregister this callback from blocking ephemeral checkpoint saves.

post_attach()[source]¶: Called right after the callback is attached to the Trainer.

post_checkpoint_loaded(path)[source]¶

Called when a checkpoint is successfully loaded.

Parameters:: path (Union[Path, PathLike, str]) – The path/URL to the checkpoint.

pre_train()[source]¶: Runs before the training loop starts.

pre_epoch()[source]¶: Runs before the start of a new epoch.

pre_load_batch()[source]¶: Runs right before the next batch is fetched from the data loader.

pre_step(batch)[source]¶: Runs right before a training batch is processed.

pre_optim_step()[source]¶: Runs right after the forward-backward passes, right before the optimizer step.

post_train_batch()[source]¶: Runs after a training batch is processed.

post_step()[source]¶: Runs after a complete step (potentially including evals and checkpointing).

post_checkpoint_saved(path)[source]¶

Called when a checkpoint is successfully saved.

Parameters:: path (Union[Path, PathLike, str]) – The path/URL to the checkpoint.

pre_log_metrics(step, metrics)[source]¶: Called when metrics have been gathered for a given step (possibly a previous step), but right before log_metrics(). This can used to modify, add, or remove metrics by updating the metrics dict in-place.

log_metrics(step, metrics)[source]¶: Called when metrics have been gathered for a given step (possibly a previous step).

post_epoch()[source]¶: Runs at the end of a complete epoch.

post_train()[source]¶: Runs after the training loop successfully completes.

on_error(exc)[source]¶: Called when the training loop exits with an error.

close()[source]¶: Always called right before Trainer.fit() exits, even on an error.

class olmo_core.train.callbacks.CallbackConfig[source]¶

Bases: Callback, Config

An alternative way to define callbacks when the callback class itself can’t be serialized.

abstract build(trainer)[source]¶

Build the actual Callback.

Return type:: Optional[Callback]

class olmo_core.train.callbacks.CheckpointerCallback(save_interval=250, ephemeral_save_interval=None, pre_train_checkpoint=None, save_async=None, remove='ephemeral_only', ephemeral_cooldown=None, fixed_steps=None, max_checkpoints=3, enabled=True, _latest_checkpoint_step=-1, _latest_checkpoint_path='', _checkpoints=<factory>, _ephemeral_checkpoints=<factory>, _checkpoints_to_remove=<factory>)[source]¶

Bases: Callback

Manages checkpointing during training, including writing checkpoints at set intervals determined by save_interval and ephemeral_save_interval, as well as removing old checkpoints found in the save folder as determined by the remove setting.

Important

This callback gets added automatically if you don’t explicitly configure it. If you want to override this callback you should subclass it.

priority: ClassVar[int] = 1¶: Priority of the callback. Determines the order in which callbacks run relative to each other. The higher the priority, the earlier a callback runs.

save_interval: Optional[int] = 250¶: The interval, in steps, with which to save permanent checkoints.

ephemeral_save_interval: Optional[int] = None¶

The interval, in steps, with which to save temporary checkpoints. These checkpoints are removed each time a new checkpoint is saved.

It can be useful to set this to a relatively frequent interval for preemptible jobs.

pre_train_checkpoint: Optional[bool] = None¶: Save a pretrain checkpoint. Defaults to True unless the trainer resumes from a checkpoint.

save_async: Optional[bool] = None¶: Save checkpoints asynchronously. Requires a separate CPU-only backend. Defaults to True if there is one.

remove: CheckpointRemovalStrategy = 'ephemeral_only'¶: The strategy for removing old checkpoints found in the save folder.

ephemeral_cooldown: Optional[int] = None¶: The number of steps to wait after saving a checkpoint before saving another ephemeral one is allowed.

fixed_steps: Optional[List[int]] = None¶: A list of fixed steps at which to save additional permanent checkpoints.

max_checkpoints: Optional[int] = 3¶: Maximum number of permanent checkpoints to keep. When a new permanent checkpoint is saved and the count exceeds this limit, the oldest is removed. Set to None to keep all checkpoints (previous behavior).

Note

Checkpoints saved at fixed_steps are counted toward this limit.

class olmo_core.train.callbacks.CheckpointRemovalStrategy(value)[source]¶

Bases: StrEnum

An enumeration of the different strategies for removing old checkpoints found in the save folder.

ephemeral_only = 'ephemeral_only'¶: Only remove checkpoints that were saved at the CheckpointerCallback.ephemeral_save_interval.

all_non_permanent = 'all_non_permanent'¶: Remove all non-permanent checkpoints found, including ephemeral checkpoints and also any other checkpoints that were not saved at the CheckpointerCallback.save_interval.

never = 'never'¶: Never remove any old checkpoints found in the save folder.

class olmo_core.train.callbacks.CometCallback(enabled=True, name=None, project=None, workspace=None, tags=None, config=None, cancel_tags=<factory>, cancel_check_interval=None, notifications='none', failure_tag='failed', auto_resume=False, _exp_key=None, _finalized=False)[source]¶

Bases: Callback

Logs metrics to Comet.ml from rank 0.

Important

Requires the comet_ml package and the environment variable COMET_API_KEY.

Note

This callback logs metrics from every single step to Comet.ml, regardless of the value of Trainer.metrics_collect_interval.

enabled: bool = True¶: Set to false to disable this callback.

name: Optional[str] = None¶: The name to give the Comet.ml experiment.

project: Optional[str] = None¶: The Comet.ml project to use.

workspace: Optional[str] = None¶: The name of the Comet.ml workspace to use.

tags: Optional[List[str]] = None¶: Tags to assign the experiment.

config: Optional[Dict[str, Any]] = None¶: The config to save to Comet.ml.

cancel_tags: Optional[List[str]]¶: If you add any of these tags to an experiment on Comet.ml, the run will cancel itself. Defaults to ["cancel", "canceled", "cancelled"].

cancel_check_interval: Optional[int] = None¶: Check for cancel tags every this many steps. Defaults to olmo_core.train.Trainer.cancel_check_interval.

notifications: CometNotificationSetting = 'none'¶: The notification settings.

failure_tag: str = 'failed'¶: The tag to assign to failed experiments.

auto_resume: bool = False¶: If True, an existing experiment will be resumed from a checkpoint if the experiment name matches.

class olmo_core.train.callbacks.CometNotificationSetting(value)[source]¶

Bases: StrEnum

Defines the notifications settings for the Comet.ml callback.

all = 'all'¶: Send all types notifications.

end_only = 'end_only'¶: Only send a notification when the experiment ends (successfully or with a failure).

failure_only = 'failure_only'¶: Only send a notification when the experiment fails.

none = 'none'¶: Don’t send any notifcations.

class olmo_core.train.callbacks.ConfigSaverCallback(fname='config.json', save_data_paths=None, data_paths_fname=None, _config=None)[source]¶

Bases: Callback

A callback that writes an arbitrary JSON-serializable config dictionary (config) to every checkpoint directory written during training. It will also set the config to save for other callbacks, including the WandBCallback, CometCallback, and others, if not already set.

Important

The config should be set after initializing the trainer and attaching all other callbacks.

property config: Dict[str, Any] | None¶: The JSON config dictionary to record.

class olmo_core.train.callbacks.ConsoleLoggerCallback(log_interval=1, metrics_log_interval=None, metrics=<factory>)[source]¶

Bases: Callback

Logs progress and a subset of metrics to the console.

Important

This callback gets added automatically if you don’t explicitly configure it. If you want to override this callback you should subclass it.

log_interval: int = 1¶: How often, in steps, to log progress to the console.

metrics_log_interval: Optional[int] = None¶: How often, in steps, to log metrics to the console. If not set, defaults to log_interval.

metrics: List[str]¶: Metrics to log to the console. Wildcards are supported.

class olmo_core.train.callbacks.EvaluatorCallback(evaluators=<factory>, eval_interval=1000, fixed_steps=None, eval_on_startup=False, eval_on_finish=False, cancel_after_first_eval=False, eval_duration=<factory>, log_interval=5)[source]¶

Bases: Callback

Runs in-loop evaluations for a TransformerTrainModule periodically during training.

evaluators: List[Evaluator]¶: The evaluators to run.

eval_interval: Optional[int] = 1000¶: The interval (in steps) with which to run the evaluators.

fixed_steps: Optional[List[int]] = None¶: A list of fixed steps at which to run the evaluators.

eval_on_startup: bool = False¶: Whether to run an evaluation when the trainer starts up.

eval_on_finish: bool = False¶: Whether to run an evaluation when training finishes.

cancel_after_first_eval: bool = False¶: If True, cancel the run after running evals for the first time. This combined with eval_on_startup=True is useful if you just want to run in-loop evals without training any longer.

eval_duration: Duration¶: The duration to run each evaluator for.

log_interval: int = 5¶: How often to log eval progress to the console during an eval loop.

class olmo_core.train.callbacks.LMEvaluatorCallbackConfig(eval_dataset, eval_interval=1000, fixed_steps=None, eval_on_startup=False, eval_on_finish=False, cancel_after_first_eval=False, eval_duration=<factory>, log_interval=5, deterministic=True, enabled=True)[source]¶: Bases: CallbackConfig

class olmo_core.train.callbacks.DownstreamEvaluatorCallbackConfig(tasks, tokenizer, eval_interval=1000, fixed_steps=None, eval_duration=<factory>, eval_on_startup=False, eval_on_finish=False, cancel_after_first_eval=False, log_interval=5, lazy=False, enabled=True)[source]¶: Bases: CallbackConfig

class olmo_core.train.callbacks.GAPMonitorCallback(enabled=True, monitor=None, interval=1, dump_gradients=None, dump_gradients_start_step=0, dump_gradients_end_step=None, dump_gradients_step_interval=1, dump_gradients_save_first_n=None, _handles=None, _local_batch_size_instances=1, _dry_run_complete=False)[source]¶

Bases: Callback

Gradient, activation, and parameter (GAP) monitoring callback.

This callback logs fine-grained statistics on all gradients, activations, and parameters.

It can also dump raw gradient tensors to disk for offline analysis. Set dump_gradients=True and configure the dump_gradients_* fields to control when and how gradients are saved.

enabled: bool = True¶: Master switch. When False, all monitoring and gradient dumping is disabled.

monitor: Optional[bool] = None¶: Whether to run GAP monitoring (forward/backward hooks, per-tensor stats). Only takes effect when enabled=True. Defaults to True when enabled=True.

interval: int = 1¶: How often (in steps) to measure statistics. Default is every step.

dump_gradients: Optional[bool] = None¶: Whether to dump raw gradient tensors to disk for offline analysis. Only takes effect when enabled=True. Defaults to False when enabled=True.

dump_gradients_start_step: int = 0¶: Step at which to begin dumping gradients. Inclusive.

dump_gradients_end_step: Optional[int] = None¶: Step at which to stop dumping gradients. Inclusive. If None, runs until training ends.

dump_gradients_step_interval: int = 1¶: How often (in steps) to dump gradients. Must be positive.

dump_gradients_save_first_n: Optional[int] = None¶: If set, gather the full gradient to rank 0 and save only the first N elements of each dimension, storing as a single safetensors file. If None, saves the full distributed gradient via distributed checkpoint. Must be positive if set.

class olmo_core.train.callbacks.GarbageCollectorCallback(gc_interval=1000, enabled=True, _start_state=None)[source]¶

Bases: Callback

Disables automatic garbage collection during training and runs gen1 collection on a set schedule instead.

Important

This callback gets added automatically in a distributed training setting if you don’t explicitly configure it. If you want to override this callback you should subclass it.

class olmo_core.train.callbacks.GPUMemoryMonitorCallback(device_id=None, _num_alloc_retries=0)[source]¶

Bases: Callback

Adds metrics for GPU memory statistics.

priority: ClassVar[int] = -1¶: Priority of the callback. Determines the order in which callbacks run relative to each other. The higher the priority, the earlier a callback runs.

class olmo_core.train.callbacks.HFConverterCallback(enabled=True, output_folder=None, dtype='bfloat16', validate=False, debug=False, tokenizer_id=None, max_sequence_length=None, device=None, moe_capacity_factor=None)[source]¶

Bases: Callback

Converts the final saved checkpoint to HuggingFace format at the end of a training job.

This callback runs after training completes and uses olmo_core.nn.hf.convert_checkpoint_to_hf() to convert the final OLMo Core checkpoint to a HuggingFace-compatible format.

Note

This callback requires the transformers library to be installed.

Warning

In distributed training, ALL ranks must participate in this callback because gathering the full model state dict from FSDP requires collective operations. Only rank 0 performs the actual HF conversion and saving.

priority: ClassVar[int] = -1¶: Priority of the callback. Determines the order in which callbacks run relative to each other. The higher the priority, the earlier a callback runs.

enabled: bool = True¶: Whether this callback is enabled. Set to False to disable HF conversion.

output_folder: Optional[str] = None¶: The folder to save the HuggingFace checkpoint to. If not specified, defaults to {checkpoint_path}-hf where checkpoint_path is the final checkpoint path.

dtype: Optional[DType] = 'bfloat16'¶: The dtype to save the HuggingFace model weights as. Defaults to bfloat16.

validate: bool = False¶: Whether to validate the converted model against the original model. Validation loads both models and compares their outputs.

debug: bool = False¶: Whether to output debug information during validation. Only has an effect if validate is True.

tokenizer_id: Optional[str] = None¶: The HuggingFace tokenizer identifier to save with the model. If not specified, uses the tokenizer from the experiment config.

max_sequence_length: Optional[int] = None¶: The maximum sequence length for the model. If not specified, uses the tokenizer’s default max length.

device: Optional[str] = None¶: The device to use for conversion. Defaults to CPU.

moe_capacity_factor: Optional[float] = None¶: The MoE capacity factor. Higher values can decrease validation false negatives but may cause OOM errors. Only relevant for MoE models.

class olmo_core.train.callbacks.ProfilerCallback(skip_first=0, wait=1, warmup=5, active=3, repeat=1, with_stack=True, profile_memory=False, enable_cuda_sync_events=False, enabled=True, ranks=None, _first_batch=True)[source]¶

Bases: Callback

Enables profiling/tracing of training steps using torch.profiler. Saved the results to a subdirectory of the save folder named “profiler”.

skip_first: int = 0¶: Ignore this many steps before profiling cycles.

wait: int = 1¶: Idle for this many steps before activating.

warmup: int = 5¶: Start tracing, but discard the results, for this many steps.

active: int = 3¶: Actively trace this many steps.

repeat: int = 1¶: Repeat the cycle start at wait steps.

with_stack: bool = True¶: Whether to record source information (file and line number) for the ops.

profile_memory: bool = False¶: Whether to track tensor memory allocation/deallocation

enable_cuda_sync_events: bool = False¶: Whether to enable recording of CUDA sync events. Useful for critical-path analysis with https://hta.readthedocs.io/en/latest/source/features/lightweight_critical_path_analysis.html

enabled: bool = True¶: Set to False to disable profiling.

ranks: Optional[str] = None¶

Ranks to profile. Can be:

None: Only rank 0 is profiled
String shortcuts: - "dp": Profile one rank (local rank 0) in each data parallel group - "tp": Profile one rank (local rank 0) in each tensor parallel group - "cp": Profile one rank (local rank 0) in each context parallel group - "pp": Profile one rank (local rank 0) in each pipeline parallel group - "ep": Profile one rank (local rank 0) in each expert parallel group - "all": Profile all ranks

Useful in conjunction with https://github.com/facebookresearch/HolisticTraceAnalysis to analyze traces from a distributed training job.

class olmo_core.train.callbacks.SlackNotifierCallback(name=None, notifications='end_only', enabled=True, webhook_url=None)[source]¶

Bases: Callback

name: Optional[str] = None¶: A name to give the run.

notifications: SlackNotificationSetting = 'end_only'¶: The notification settings.

enabled: bool = True¶: Set to false to disable this callback.

webhook_url: Optional[str] = None¶: The webhook URL to post. If not set, will check the environment variable SLACK_WEBHOOK_URL.

class olmo_core.train.callbacks.SlackNotificationSetting(value)[source]¶

Bases: StrEnum

Defines the notifications settings for the Slack notifier callback.

all = 'all'¶: Send all types notifications.

end_only = 'end_only'¶: Only send a notification when the experiment ends (successfully or with a failure).

failure_only = 'failure_only'¶: Only send a notification when the experiment fails.

none = 'none'¶: Don’t send any notifications.

class olmo_core.train.callbacks.SequenceLengthSchedulerCallback(min_sequence_length=128, warmup_steps=2000, truncate=False, keep_multiple_of=128, enabled=True, _og_rank_microbatch_size=None, _last_seq_len=None)[source]¶

Bases: Callback

A Callback for introducing a linear sequence-length warm-up schedule over the course of warmup_steps starting from min_sequence_length and ending at the configured training sequence length (NumpyFSLDataset.sequence_length <olmo_core.data.NumpyFSLDataset.sequence_length).

When truncate is False the scheduler works by splitting each instance in a batch into more shorter instances while maintaining the same number of tokens in each batch and micro-batch. In this case the sequence length set during the warm-up will always be a multiple of min_sequence_length by a power of 2, and therefore the train sequence length must be a multiple of min_sequence_length by a power of 2.

Otherwise the scheduler simply truncates the instances in the batch to the desired sequence length, throwing out the extra tokens. The scheduler will ensure the sequence length during the warm-up is always a multiple of keep_multiple_of.

Important

This callback is only compatible with a NumpyFSLDataLoader training data_loader.

Note

The “total tokens” recorded by the trainer and SpeedMonitorCallback will still include tokens truncated by this callback for bookkeeping purposes.

class olmo_core.train.callbacks.SpeedMonitorCallback(num_flops_per_token=None, num_params=None, device_peak_flops_per_second=None, _total_steps=0, _total_tokens=0, _total_flops=0, _start_time=0.0, _first_step=True, _step_last_logged=0.0, _batch_load_start=0.0, _batch_load_time=0.0, _step_tokens=0, _step_seq_len=0, _step_flops=0, _parallel_degree=1, _bps_avg=None, _tps_avg=None, _mfu_avg=None)[source]¶

Bases: Callback

Monitors throughput.

Important

This callback gets added automatically if you don’t explicitly configure it. If you want to override this callback you should subclass it.

priority: ClassVar[int] = -2¶: Priority of the callback. Determines the order in which callbacks run relative to each other. The higher the priority, the earlier a callback runs.

class olmo_core.train.callbacks.StabilityMonitorCallback(window_size=128, rolling_window=10000, threshold_std=6.0, enabled=True, loss_metric_name='train/CE loss', grad_norm_metric_name='optim/total grad norm', _loss_history=<factory>, _grad_norm_history=<factory>, _spike_history=<factory>, _total_spike_count=0, _total_step_count=0)[source]¶

Bases: Callback

Monitors training stability by tracking “spikes” in loss and gradient norm.

A spike is detected when a value exceeds the running mean of the last window_size values by more than threshold_std standard deviations. This helps identify training instability.

Metrics recorded:

spike/SpikeScore: Running spike rate over the last rolling_window steps. Only recorded once the rolling window is full.
spike/SpikeScore (total): Cumulative spike rate (total spikes / total steps).

window_size: int = 128¶: Number of recent values to use for computing mean and std for spike detection.

rolling_window: int = 10000¶: Number of recent steps to use for computing running SpikeScore.

threshold_std: float = 6.0¶: Number of standard deviations above the mean to consider a spike.

enabled: bool = True¶: Whether this callback is enabled.

class olmo_core.train.callbacks.WandBCallback(enabled=True, name=None, project=None, entity=None, group=None, tags=None, notes=None, config=None, cancel_tags=<factory>, cancel_check_interval=None, _finalized=False)[source]¶

Bases: Callback

Logs metrics to Weights & Biases from rank 0.

Important

Requires the wandb package and the environment variable WANDB_API_KEY.

Note

This callback logs metrics from every single step to W&B, regardless of the value of Trainer.metrics_collect_interval.

enabled: bool = True¶: Set to false to disable this callback.

name: Optional[str] = None¶: The name to give the W&B run.

project: Optional[str] = None¶: The W&B project to use.

entity: Optional[str] = None¶: The W&B entity to use.

group: Optional[str] = None¶: The W&B group to use.

tags: Optional[List[str]] = None¶: Tags to assign the run.

notes: Optional[str] = None¶: A note/description of the run.

config: Optional[Dict[str, Any]] = None¶: The config to load to W&B.

cancel_tags: Optional[List[str]]¶: If you add any of these tags to a run on W&B, the run will cancel itself. Defaults to ["cancel", "canceled", "cancelled"].

cancel_check_interval: Optional[int] = None¶: Check for cancel tags every this many steps. Defaults to olmo_core.train.Trainer.cancel_check_interval.

class olmo_core.train.callbacks.BeakerCallback(experiment_id=None, update_interval=None, description=None, enabled=None, config=None, result_dir='/results', _url=None, _last_update=None)[source]¶

Bases: Callback

Adds metadata to the Beaker experiment description when running as a Beaker batch job.

priority: ClassVar[int] = -1¶: Priority of the callback. Determines the order in which callbacks run relative to each other. The higher the priority, the earlier a callback runs.

config: Optional[Dict[str, Any]] = None¶: A JSON-serializable config to save to the results dataset as config.json.

result_dir: str = '/results'¶: The directory of the Beaker results dataset where the config and other data will be saved.

class olmo_core.train.callbacks.BatchSizeSchedulerCallback(batch_sizes=<factory>, schedule=<factory>, enabled=True)[source]¶

Bases: Callback

A callback for setting a batch size scheduler over the course of a training run. Also adjusts the base learning rate with Adam optimizers for transformer train modules by a factor of sqrt(new_batch_size / current_batch_size).

batch_sizes: List[int]¶: Defines the batch sizes to apply, in order.

schedule: List[Duration]¶: Defines the schedule at which to apply each batch size.

class olmo_core.train.callbacks.MonkeyPatcherCallback[source]¶

Bases: Callback

While looking into performance issues with OLMo3 training, we discovered that DeviceMesh.__getitem__() can become a bottleneck because it gets called very often by FSDP and creates a new sub-mesh object each time. So this callback patches that method to cache the sub-meshes.

class olmo_core.train.callbacks.MetricSaverCallback(step_metrics_fname='metrics_step{step}.json', final_metrics_fname='metrics.json', metrics_to_capture=None, save_interval=None, fixed_steps=None, enabled=True, _metrics=None, _metrics_step=0)[source]¶

Bases: Callback

A callback that captures the latest metrics on rank 0 and saves to a JSON file in the trainer’s save_folder.

step_metrics_fname: str = 'metrics_step{step}.json'¶: The filename to save the step metrics to, with {step} as a placeholder for the step number.

final_metrics_fname: str = 'metrics.json'¶: The filename to save the final metrics to.

metrics_to_capture: Optional[List[str]] = None¶: An optional list of glob patterns to filter which metrics to capture. If None, all metrics are captured.

save_interval: Optional[int] = None¶: An optional interval (in steps) at which to save the metrics.

fixed_steps: Optional[List[int]] = None¶: An optional list of fixed steps at which to save the metrics.

property metrics: Dict[str, Any] | None¶: The latest metrics recorded.

class olmo_core.train.callbacks.ModelMergeCallback(merge_step=<factory>, merge_interval=None, merge_last_n_steps=500, output_suffix='merged', enabled=False, _accumulators=<factory>, _accumulator_counts=<factory>, _merge_steps=<factory>, _completed_merges=<factory>)[source]¶

Bases: Callback

Averages model weights over the last merge_last_n_steps before each merge_step and saves the result as a merged checkpoint.

Ephemeral checkpoints are blocked during merge windows to ensure the full window is always re-accumulated on resume.

Warning

This callback should be enabled with intention and configured with your training schedule in mind. Merge steps should be configured outside of decay phases where possible to ensure the averaged weights reflect a stable training regime.

priority: ClassVar[int] = 2¶: Priority of the callback. Determines the order in which callbacks run relative to each other. The higher the priority, the earlier a callback runs.

merge_step: Union[int, List[int]]¶: The step(s) at which to save merged checkpoint(s).

merge_interval: Optional[int] = None¶: Merge every N steps. Alternative to explicit merge_step.

merge_last_n_steps: int = 500¶: Number of steps before each merge step to start accumulating the average.

output_suffix: str = 'merged'¶: Suffix for merged checkpoint directory.

class olmo_core.train.callbacks.ListCheckpointerCallback(save_interval=1000000000, ephemeral_save_interval=None, pre_train_checkpoint=None, save_async=None, remove='ephemeral_only', ephemeral_cooldown=None, fixed_steps=None, max_checkpoints=3, enabled=True, _latest_checkpoint_step=-1, _latest_checkpoint_path='', _checkpoints=<factory>, _ephemeral_checkpoints=<factory>, _checkpoints_to_remove=<factory>, save_steps=None)[source]¶

Bases: CheckpointerCallback

Save checkpoints only at specific steps provided in a list.

Pass ‘save_steps’ as a sorted list of step numbers (integers) at which to save. All other base behavior (async save, removal) is preserved.

This is useful for saving at predetermined milestones, such as: - Period boundaries in WSD-S schedules (when LR = 0) - Specific token budgets - Other training milestones

Example

save_steps = [100, 500, 1000, 2000] # save at these exact steps

save_interval: int = 1000000000¶: The interval, in steps, with which to save permanent checkoints.

train.callbacks¶

`train.callbacks`¶