`distributed.checkpoint`¶

A high-level distributed checkpointing module with a unified API for saving and loading both local and remote checkpoints.

Features¶

Save with one distributed topology, seamlessly load with a different one. For example, with FSDP/FSDP2 you can save/load checkpoints with different world sizes or sharding strategies.
Save/load directly to/from a remote object store like S3 or GCS. When loading from a remote object store each rank only downloads the fraction of the data it needs for its local (potentially sharded) tensors.

Overview¶

Use save_model_and_optim_state() to write a checkpoint with your model and optimizer’s state, then use load_model_and_optim_state() to load the checkpoint in-place.

You can unshard a checkpoint saved this way with unshard_checkpoint().

API Reference¶

olmo_core.distributed.checkpoint.save_state_dict(dir, state_dict, *, process_group=None, save_overwrite=False, thread_count=None, process_count=None, throttle_uploads=False, enable_plan_caching=False, _skip_prepare=False)[source]¶

Save an arbitrary state dictionary to a distributed format that can loaded again with a different distributed topology.

Important

Please use save_model_and_optim_state() to save model/optimizer state dicts instead unless you know what you’re doing.

Parameters:

dir (Union[Path, PathLike, str]) – Path/URL to save to.
state_dict (Dict[str, Any]) – The state dict to save.
process_group (Optional[ProcessGroup], default: None) – The process group to use for distributed collectives.
save_overwrite (bool, default: False) – Overwrite existing files.
thread_count (Optional[int], default: None) – Set this to override the number of threads used while writing data.
process_count (Optional[int], default: None) – Set this to use a process pool instead of a thread pool when possible (currently not compatible with throttle_uploads).
throttle_uploads (bool, default: False) – If this is set to True and dir is a URL then only one rank from each node will upload data at a time.

olmo_core.distributed.checkpoint.async_save_state_dict(dir, state_dict, *, process_group=None, save_overwrite=False, thread_count=None, process_count=None, throttle_uploads=False, enable_plan_caching=False, _skip_prepare=False)[source]¶

An async version of save_state_dict().

This code first de-stages the state dict on the CPU, then writes it in a separate thread.

Return type:: Future[None]

olmo_core.distributed.checkpoint.load_state_dict(dir, state_dict, *, process_group=None, pre_download=False, work_dir=None, thread_count=None)[source]¶

Load an arbitrary state dict in-place from a checkpoint saved with save_state_dict().

Parameters:

dir (Union[Path, PathLike, str]) – Path/URL to the checkpoint saved via save_state_dict().
state_dict (Dict[str, Any]) – The state dict to load the state into.
process_group (Optional[ProcessGroup], default: None) – The process group to use for distributed collectives.
thread_count (Optional[int], default: None) – Set the number of threads used for certain operations.

olmo_core.distributed.checkpoint.save_model_and_optim_state(dir, model, optim=None, *, process_group=None, save_overwrite=False, flatten_optimizer_state=False, thread_count=None, process_count=None, throttle_uploads=False, enable_plan_caching=False)[source]¶

Save model and optimizer state dictionaries. The model state can be a sharded model, in which case this method will correctly handle the optimizer state to ensure it can be loaded again with a different distributed topology through load_model_and_optim_state().

distributed.checkpoint¶

Features¶

Overview¶

API Reference¶

`distributed.checkpoint`¶