launch

An API for launching experiments on various platforms.

Beaker

Launch experiments on Beaker.

class olmo_core.launch.beaker.OLMoCoreBeakerImage(value)[source]

Bases: StrEnum

Official Beaker images that work well for OLMo-core.

You can find the full list at beaker.org/orgs/ai2/workspaces/olmo-core/images, which includes versioned images that are published with each release of the OLMo-core package.

stable = 'tylerr/olmo-core-tch291cu128-2025-11-25'

Built with the latest compatible stable version of PyTorch.

stable_cu130 = 'tylerr/olmo-core-tch291cu130-2025-11-25'

The stable image with CUDA pinned to 13.0.

stable_cu128 = 'tylerr/olmo-core-tch291cu128-2025-11-25'

The stable image with CUDA pinned to 12.8.

tch291_cu129 = 'petew/olmo-core-tch291cu129-2026-01-24'

Built with torch 2.9.1 and CUDA 12.9. Comes with flash-attn 4 (CUTE implementation) and Quack kernels.

To rebuild: make beaker-image TORCH_VERSION=2.9.1 QUACK_VERSION=0.2.4 CUDA_VERSION=12.9.1.

tch291_cu128 = 'petew/olmo-core-tch291cu128-FA4'

Built with torch 2.9.1 and CUDA 12.8. Comes with flash-attn 4 (CUTE implementation).

tch2100_cu128 = 'petew/olmo-core-tch2100cu128-2026-01-23'

Built with torch 2.10.0 and CUDA 12.8.

tch280_cu128 = 'tylerr/olmo-core-tch280cu128-2025-11-25'

Built with torch 2.8.0 and CUDA 12.8.

tch271_cu128 = 'tylerr/olmo-core-tch271cu128-2025-11-25'

Built with torch 2.7.1 and CUDA 12.8.

tch270_cu128 = 'petew/olmo-core-tch270cu128-2025-05-16'

Built with torch 2.7.0 and CUDA 12.8. Battle tested when training Olmo3 7B and 32B. No TransformerEngine or flash-attention-3.

tch271_cu126 = 'petew/olmo-core-tch271cu126-2025-09-15'

Built with torch 2.7.1 and CUDA 12.6. No TransformerEngine or flash-attention-3.

class olmo_core.launch.beaker.BeakerLaunchConfig(name, cmd, torchrun=None, budget=None, task_name='train', workspace=None, description=None, beaker_image='tylerr/olmo-core-tch291cu128-2025-11-25', num_nodes=1, num_gpus=8, shared_memory='10GiB', clusters=<factory>, hostnames=None, gpu_types=None, tags=None, shared_filesystem=False, priority='normal', preemptible=True, retries=None, env_vars=<factory>, env_secrets=<factory>, google_credentials_secret=None, aws_config_secret=None, aws_credentials_secret=None, weka_buckets=<factory>, allow_dirty=False, host_networking=None, git=<factory>, result_dir='/results', system_python=True, num_execution_units=None, follow=True, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, pre_setup=None, post_setup=None)[source]

Bases: Config

Config for launching experiments on Beaker.

name: str

A name to assign the Beaker experiment.

cmd: list[str]

The command to run in the container.

torchrun: Optional[bool] = None

Launch the command with torchrun. Defaults to true for multi-GPU jobs.

budget: Optional[str] = None

The budget group to assign.

task_name: str = 'train'

A name to assign the Beaker tasks created.

workspace: Optional[str] = None

The Beaker workspace to use.

description: Optional[str] = None

A description for the experiment.

beaker_image: str = 'tylerr/olmo-core-tch291cu128-2025-11-25'

The Beaker image to use.

Suitable images can be found at beaker.org/ws/ai2/OLMo-core/images.

num_nodes: int = 1

The number of nodes to use.

num_gpus: int = 8

The number of GPUs to use per node.

shared_memory: str = '10GiB'

The amount of shared memory to use.

clusters: list[str]

The allowed clusters to run on.

hostnames: Optional[list[str]] = None

Manual hostname constraints. Takes priority over clusters and other placement filters.

gpu_types: Optional[list[str]] = None

Cluster GPU type constraints.

tags: Optional[list[str]] = None

Cluster tag constraints.

shared_filesystem: bool = False

Set this to true if the save folder and working directory for each node is part of a global shared filesystem (like weka or NFS).

priority: str = 'normal'

The job priority.

preemptible: bool = True

If the job should be preemptible.

retries: Optional[int] = None

The number of times to retry the experiment if it fails.

env_vars: list[BeakerEnvVar]

Additional env vars to include.

env_secrets: list[BeakerEnvSecret]

Environment variables to add from secrets.

google_credentials_secret: Optional[str] = None

Name of the Beaker secret containing Google credentials JSON, if needed.

aws_config_secret: Optional[str] = None

The name of the Beaker secret containing an AWS config file, if needed.

aws_credentials_secret: Optional[str] = None

The name of the Beaker secret containing an AWS credentials file, if needed.

weka_buckets: list[BeakerWekaBucket]

Weka buckets to attach and where to attach them.

allow_dirty: bool = False

Allow running with uncommitted changed.

host_networking: Optional[bool] = None

Enable host-networking.

git: GitRepoState

Git configuration, specifies where to clone your source code from and which commit to check out. If not set, this will be initialized automatically from your working directory.

result_dir: str = '/results'

The directory of the Beaker results dataset.

system_python: bool = True

Use the system Python installation in the Beaker image.

num_execution_units: Optional[int] = None

Number of “execution units”, defaults to 1. An “execution unit” is abstraction for any node-using entity of which 1 or more copies are run, where each unit wants its nodes to be from colocated hardware (e.g., a model replica for large jobs, or a full distributed model for small jobs).

For example, when training with HSDP it would make sense to set num_execution_units to the replica degree of the device mesh.

follow: bool = True

Follow the experiment logs locally after launching.

slack_notifications: Optional[bool] = None

Get Slack notifications for experiment status updates when following logs. Defaults to true if follow is true and the env var SLACK_WEBHOOK_URL is set.

launch_timeout: Optional[int] = None

A timeout in seconds to wait for the job to start after submission. If the job doesn’t start in time a timeout error will be raised.

step_timeout: Optional[int] = None

A timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time a timeout error will be raised.

step_soft_timeout: Optional[int] = None

A soft timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time warning will be issued.

pre_setup: Optional[str] = None

A command to run before the setup steps.

post_setup: Optional[str] = None

A command to run after the setup steps.

property default_env_vars: list[tuple[str, str]]

Default env vars to add to the experiment.

dry_run(follow=None, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, torchrun=None)[source]

Do a dry-run without actually launching the experiment. Arguments are the same as launch().

Return type:

None

launch(follow=None, slack_notifications=None, launch_timeout=None, step_timeout=None, step_soft_timeout=None, torchrun=None)[source]

Launch a Beaker experiment using this config.

Parameters:
  • follow (Optional[bool], default: None) – Stream the logs and follow the experiment until completion.

  • slack_notifications (Optional[bool], default: None) – If follow=True, send Slack notifications when the run launches, fails, or succeeds. This requires the env var SLACK_WEBHOOK_URL.

  • launch_timeout (Optional[int], default: None) – A timeout in seconds to wait for the job to start after submitting it. If the job doesn’t start in time a timeout error will be raised.

  • step_timeout (Optional[int], default: None) – A timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time a timeout error will be raised.

  • step_soft_timeout (Optional[int], default: None) – A soft timeout in seconds to wait for new steps (and new logs) when follow=True. If no new logs are detected in a time warning will be issued.

  • torchrun (Optional[bool], default: None) – Launch the target command with torchrun. This will default to True if num_gpus > 1 and False otherwise.

Return type:

Workload

Returns:

The Beaker workload.

class olmo_core.launch.beaker.BeakerEnvVar(name, value)[source]

Bases: Config

class olmo_core.launch.beaker.BeakerEnvSecret(name, secret, required=True)[source]

Bases: Config

class olmo_core.launch.beaker.BeakerWekaBucket(bucket, mount)[source]

Bases: Config

olmo_core.launch.beaker.is_running_in_beaker()[source]

Check if the current process is running inside of a Beaker job (batch or session).

Return type:

bool

olmo_core.launch.beaker.is_running_in_beaker_batch_job()[source]

Check if the current process is running inside a Beaker batch job (as opposed to a session).

Return type:

bool