launch¶
An API for launching experiments on various platforms.
Beaker¶
Launch experiments on Beaker.
- class olmo_core.launch.beaker.OLMoCoreBeakerImage(value)[source]¶
Bases:
StrEnumOfficial Beaker images that work well for OLMo-core.
You can find the full list at beaker.org/orgs/ai2/workspaces/olmo-core/images, which includes versioned images that are published with each release of the OLMo-core package.
- stable = 'tylerr/olmo-core-tch291cu128-2025-11-25'¶
Built with the latest compatible stable version of PyTorch.
- stable_cu130 = 'tylerr/olmo-core-tch291cu130-2025-11-25'¶
The stable image with CUDA pinned to 13.0.
- stable_cu128 = 'tylerr/olmo-core-tch291cu128-2025-11-25'¶
The stable image with CUDA pinned to 12.8.
- tch291_cu129 = 'petew/olmo-core-tch291cu129-2026-01-24'¶
Built with torch 2.9.1 and CUDA 12.9. Comes with flash-attn 4 (CUTE implementation) and Quack kernels.
To rebuild:
make beaker-image TORCH_VERSION=2.9.1 QUACK_VERSION=0.2.4 CUDA_VERSION=12.9.1.
- tch291_cu128 = 'petew/olmo-core-tch291cu128-FA4'¶
Built with torch 2.9.1 and CUDA 12.8. Comes with flash-attn 4 (CUTE implementation).
- tch2100_cu128 = 'petew/olmo-core-tch2100cu128-2026-01-23'¶
Built with torch 2.10.0 and CUDA 12.8.
- tch280_cu128 = 'tylerr/olmo-core-tch280cu128-2025-11-25'¶
Built with torch 2.8.0 and CUDA 12.8.
- tch271_cu128 = 'tylerr/olmo-core-tch271cu128-2025-11-25'¶
Built with torch 2.7.1 and CUDA 12.8.
- tch270_cu128 = 'petew/olmo-core-tch270cu128-2025-05-16'¶
Built with torch 2.7.0 and CUDA 12.8. Battle tested when training Olmo3 7B and 32B. No TransformerEngine or flash-attention-3.
- tch271_cu126 = 'petew/olmo-core-tch271cu126-2025-09-15'¶
Built with torch 2.7.1 and CUDA 12.6. No TransformerEngine or flash-attention-3.
- class olmo_core.launch.beaker.BeakerLaunchConfig(name, cmd, torchrun=None, budget=None, task_name='train', workspace=None, description=None, beaker_image='tylerr/olmo-core-tch291cu128-2025-11-25', num_nodes=1, num_gpus=8, shared_memory='10GiB', clusters=<factory>, hostnames=None, gpu_types=None, tags=None, shared_filesystem=False, priority='normal', preemptible=True, retries=None, env_vars=<factory>, env_secrets=<factory>, google_credentials_secret=None, aws_config_secret=None, aws_credentials_secret=None, weka_buckets=<factory>, allow_dirty=False, host_networking=None, git=<factory>, result_dir='/results', system_python=True, num_execution_units=None, follow=True, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, pre_setup=None, post_setup=None)[source]¶
Bases:
ConfigConfig for launching experiments on Beaker.
-
torchrun:
Optional[bool] = None¶ Launch the command with
torchrun. Defaults to true for multi-GPU jobs.
-
beaker_image:
str= 'tylerr/olmo-core-tch291cu128-2025-11-25'¶ The Beaker image to use.
Suitable images can be found at beaker.org/ws/ai2/OLMo-core/images.
The amount of shared memory to use.
-
hostnames:
Optional[list[str]] = None¶ Manual hostname constraints. Takes priority over
clustersand other placement filters.
Set this to true if the save folder and working directory for each node is part of a global shared filesystem (like weka or NFS).
-
env_vars:
list[BeakerEnvVar]¶ Additional env vars to include.
-
env_secrets:
list[BeakerEnvSecret]¶ Environment variables to add from secrets.
-
google_credentials_secret:
Optional[str] = None¶ Name of the Beaker secret containing Google credentials JSON, if needed.
-
aws_config_secret:
Optional[str] = None¶ The name of the Beaker secret containing an AWS config file, if needed.
-
aws_credentials_secret:
Optional[str] = None¶ The name of the Beaker secret containing an AWS credentials file, if needed.
-
weka_buckets:
list[BeakerWekaBucket]¶ Weka buckets to attach and where to attach them.
-
git:
GitRepoState¶ Git configuration, specifies where to clone your source code from and which commit to check out. If not set, this will be initialized automatically from your working directory.
-
num_execution_units:
Optional[int] = None¶ Number of “execution units”, defaults to 1. An “execution unit” is abstraction for any node-using entity of which 1 or more copies are run, where each unit wants its nodes to be from colocated hardware (e.g., a model replica for large jobs, or a full distributed model for small jobs).
For example, when training with HSDP it would make sense to set
num_execution_unitsto the replica degree of the device mesh.
-
slack_notifications:
Optional[bool] = None¶ Get Slack notifications for experiment status updates when following logs. Defaults to true if
followis true and the env var SLACK_WEBHOOK_URL is set.
-
launch_timeout:
Optional[int] = None¶ A timeout in seconds to wait for the job to start after submission. If the job doesn’t start in time a timeout error will be raised.
-
step_timeout:
Optional[int] = None¶ A timeout in seconds to wait for new steps (and new logs) when
follow=True. If no new logs are detected in a time a timeout error will be raised.
-
step_soft_timeout:
Optional[int] = None¶ A soft timeout in seconds to wait for new steps (and new logs) when
follow=True. If no new logs are detected in a time warning will be issued.
- dry_run(follow=None, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, torchrun=None)[source]¶
Do a dry-run without actually launching the experiment. Arguments are the same as
launch().- Return type:
- launch(follow=None, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, torchrun=None)[source]¶
Launch a Beaker experiment using this config.
- Parameters:
follow (
Optional[bool], default:None) – Stream the logs and follow the experiment until completion.slack_notifications (
Optional[bool], default:None) – Iffollow=True, send Slack notifications when the run launches, fails, or succeeds. This requires the env varSLACK_WEBHOOK_URL.launch_timeout (
Optional[int], default:None) – A timeout in seconds to wait for the job to start after submitting it. If the job doesn’t start in time a timeout error will be raised.step_timeout (
Optional[int], default:None) – A timeout in seconds to wait for new steps (and new logs) whenfollow=True. If no new logs are detected in a time a timeout error will be raised.step_soft_timeout (
Optional[int], default:None) – A soft timeout in seconds to wait for new steps (and new logs) whenfollow=True. If no new logs are detected in a time warning will be issued.torchrun (
Optional[bool], default:None) – Launch the target command withtorchrun. This will default toTrueifnum_gpus > 1andFalseotherwise.
- Return type:
Workload- Returns:
The Beaker workload.
-
torchrun: