`launch`¶

An API for launching experiments on various platforms.

Beaker¶

Launch experiments on Beaker.

class olmo_core.launch.beaker.OLMoCoreBeakerImage(value)[source]¶

Bases: StrEnum

Official Beaker images that work well for OLMo-core.

You can find the full list at beaker.org/orgs/ai2/workspaces/olmo-core/images, which includes versioned images that are published with each release of the OLMo-core package.

stable = 'tylerr/olmo-core-tch291cu128-2025-11-25'¶: Built with the latest compatible stable version of PyTorch.

stable_cu130 = 'tylerr/olmo-core-tch291cu130-2025-11-25'¶: The stable image with CUDA pinned to 13.0.

stable_cu128 = 'tylerr/olmo-core-tch291cu128-2025-11-25'¶: The stable image with CUDA pinned to 12.8.

tch291_cu129 = 'petew/olmo-core-tch291cu129-2026-01-24'¶

Built with torch 2.9.1 and CUDA 12.9. Comes with flash-attn 4 (CUTE implementation) and Quack kernels.

To rebuild: make beaker-image TORCH_VERSION=2.9.1 QUACK_VERSION=0.2.4 CUDA_VERSION=12.9.1.

tch291_cu128 = 'petew/olmo-core-tch291cu128-FA4'¶: Built with torch 2.9.1 and CUDA 12.8. Comes with flash-attn 4 (CUTE implementation).

tch2100_cu128 = 'petew/olmo-core-tch2100cu128-2026-01-23'¶: Built with torch 2.10.0 and CUDA 12.8.

tch280_cu128 = 'tylerr/olmo-core-tch280cu128-2025-11-25'¶: Built with torch 2.8.0 and CUDA 12.8.

tch271_cu128 = 'tylerr/olmo-core-tch271cu128-2025-11-25'¶: Built with torch 2.7.1 and CUDA 12.8.

tch270_cu128 = 'petew/olmo-core-tch270cu128-2025-05-16'¶: Built with torch 2.7.0 and CUDA 12.8. Battle tested when training Olmo3 7B and 32B. No TransformerEngine or flash-attention-3.

tch271_cu126 = 'petew/olmo-core-tch271cu126-2025-09-15'¶: Built with torch 2.7.1 and CUDA 12.6. No TransformerEngine or flash-attention-3.

class olmo_core.launch.beaker.BeakerLaunchConfig(name, cmd, torchrun=None, budget=None, task_name='train', workspace=None, description=None, beaker_image='tylerr/olmo-core-tch291cu128-2025-11-25', num_nodes=1, num_gpus=8, shared_memory='10GiB', clusters=<factory>, hostnames=None, gpu_types=None, tags=None, shared_filesystem=False, priority='normal', preemptible=True, retries=None, env_vars=<factory>, env_secrets=<factory>, google_credentials_secret=None, aws_config_secret=None, aws_credentials_secret=None, weka_buckets=<factory>, allow_dirty=False, host_networking=None, git=<factory>, result_dir='/results', system_python=True, num_execution_units=None, follow=True, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, pre_setup=None, post_setup=None)[source]¶

Bases: Config

Config for launching experiments on Beaker.

name: str¶: A name to assign the Beaker experiment.

cmd: list[str]¶: The command to run in the container.

torchrun: Optional[bool] = None¶: Launch the command with torchrun. Defaults to true for multi-GPU jobs.

budget: Optional[str] = None¶: The budget group to assign.

task_name: str = 'train'¶: A name to assign the Beaker tasks created.

workspace: Optional[str] = None¶: The Beaker workspace to use.

description: Optional[str] = None¶: A description for the experiment.

beaker_image: str = 'tylerr/olmo-core-tch291cu128-2025-11-25'¶

The Beaker image to use.

Suitable images can be found at beaker.org/ws/ai2/OLMo-core/images.

num_nodes: int = 1¶: The number of nodes to use.

num_gpus: int = 8¶: The number of GPUs to use per node.

shared_memory: str = '10GiB'¶: The amount of shared memory to use.

clusters: list[str]¶: The allowed clusters to run on.

hostnames: Optional[list[str]] = None¶: Manual hostname constraints. Takes priority over clusters and other placement filters.

gpu_types: Optional[list[str]] = None¶: Cluster GPU type constraints.

tags: Optional[list[str]] = None¶: Cluster tag constraints.

shared_filesystem: bool = False¶: Set this to true if the save folder and working directory for each node is part of a global shared filesystem (like weka or NFS).

priority: str = 'normal'¶: The job priority.

preemptible: bool = True¶: If the job should be preemptible.

retries: Optional[int] = None¶: The number of times to retry the experiment if it fails.

env_vars: list[BeakerEnvVar]¶: Additional env vars to include.

env_secrets: list[BeakerEnvSecret]¶: Environment variables to add from secrets.

google_credentials_secret: Optional[str] = None¶: Name of the Beaker secret containing Google credentials JSON, if needed.

aws_config_secret: Optional[str] = None¶: The name of the Beaker secret containing an AWS config file, if needed.

aws_credentials_secret: Optional[str] = None¶: The name of the Beaker secret containing an AWS credentials file, if needed.

weka_buckets: list[BeakerWekaBucket]¶: Weka buckets to attach and where to attach them.

allow_dirty: bool = False¶: Allow running with uncommitted changed.

host_networking: Optional[bool] = None¶: Enable host-networking.

git: GitRepoState¶: Git configuration, specifies where to clone your source code from and which commit to check out. If not set, this will be initialized automatically from your working directory.

result_dir: str = '/results'¶: The directory of the Beaker results dataset.

system_python: bool = True¶: Use the system Python installation in the Beaker image.

num_execution_units: Optional[int] = None¶

Number of “execution units”, defaults to 1. An “execution unit” is abstraction for any node-using entity of which 1 or more copies are run, where each unit wants its nodes to be from colocated hardware (e.g., a model replica for large jobs, or a full distributed model for small jobs).

For example, when training with HSDP it would make sense to set num_execution_units to the replica degree of the device mesh.

follow: bool = True¶: Follow the experiment logs locally after launching.

slack_notifications: Optional[bool] = None¶: Get Slack notifications for experiment status updates when following logs. Defaults to true if follow is true and the env var SLACK_WEBHOOK_URL is set.

launch_timeout: Optional[int] = None¶: A timeout in seconds to wait for the job to start after submission. If the job doesn’t start in time a timeout error will be raised.

step_timeout: Optional[int] = None¶: A timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time a timeout error will be raised.

step_soft_timeout: Optional[int] = None¶: A soft timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time warning will be issued.

pre_setup: Optional[str] = None¶: A command to run before the setup steps.

post_setup: Optional[str] = None¶: A command to run after the setup steps.

property default_env_vars: list[tuple[str, str]]¶: Default env vars to add to the experiment.

dry_run(follow=None, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, torchrun=None)[source]¶

Do a dry-run without actually launching the experiment. Arguments are the same as launch().

Return type:: None

launch(follow=None, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, torchrun=None)[source]¶

Launch a Beaker experiment using this config.

Parameters:

follow (Optional[bool], default: None) – Stream the logs and follow the experiment until completion.
slack_notifications (Optional[bool], default: None) – If follow=True, send Slack notifications when the run launches, fails, or succeeds. This requires the env var SLACK_WEBHOOK_URL.
launch_timeout (Optional[int], default: None) – A timeout in seconds to wait for the job to start after submitting it. If the job doesn’t start in time a timeout error will be raised.
step_timeout (Optional[int], default: None) – A timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time a timeout error will be raised.
step_soft_timeout (Optional[int], default: None) – A soft timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time warning will be issued.
torchrun (Optional[bool], default: None) – Launch the target command with torchrun. This will default to True if num_gpus > 1 and False otherwise.

Return type:

Workload

Returns:

The Beaker workload.

class olmo_core.launch.beaker.BeakerEnvVar(name, value)[source]¶: Bases: Config

class olmo_core.launch.beaker.BeakerEnvSecret(name, secret, required=True)[source]¶: Bases: Config

class olmo_core.launch.beaker.BeakerWekaBucket(bucket, mount)[source]¶: Bases: Config

olmo_core.launch.beaker.is_running_in_beaker()[source]¶

Check if the current process is running inside of a Beaker job (batch or session).

Return type:: bool

olmo_core.launch.beaker.is_running_in_beaker_batch_job()[source]¶

Check if the current process is running inside a Beaker batch job (as opposed to a session).

Return type:: bool

launch¶

Beaker¶

`launch`¶