nn.transformer

class olmo_core.nn.transformer.TransformerType(value)[source]

Bases: StrEnum

An enumeration of transformer implementations.

default = 'default'

➡️ Transformer

normalized = 'normalized'

➡️ NormalizedTransformer (nGPT)

moe = 'moe'

➡️ MoETransformer

class olmo_core.nn.transformer.TransformerConfig(d_model, vocab_size, n_layers, block, lm_head, embedding_norm=None, name='default', dtype='float32', init_method='normal', init_seed=0, init_std=0.02, embedding_init_std=None, freeze_params=None, block_pattern=None, block_overrides=None, embed_scale=None)[source]

Bases: ModelConfig

A config for easily building transformer models.

Parameters:

name (TransformerType, default: 'default') – The name of the implementation.

See Transformer for a description of the other parameters.

build(*, init_device='cpu')[source]

Build the model corresponding to this config.

Parameters:

init_device (str, default: 'cpu') – The device to put the parameters on during initialization. In a distributed setting it usually makes sense to set this to “meta”.

Return type:

Transformer

property num_params: int

The total number of parameters that a model from this config would have.

property num_active_params: int

The total number of active parameters that a model from this config would have.

property num_non_embedding_params: int

The number of parameters excluding embedding parameters.

property num_active_non_embedding_params: int

The number of active parameters excluding embedding parameters.

classmethod olmo2_100M(vocab_size, **kwargs)[source]

A 100M OLMo2 model config.

Return type:

TransformerConfig

classmethod olmo2_1B(vocab_size, **kwargs)[source]

A 1B OLMo2 model config.

This is different from the OLMo 1B from the old OLMo trainer.

Return type:

TransformerConfig

classmethod olmo2_1B_v2(vocab_size, **kwargs)[source]

A 1B OLMo2 model config.

This matches the OLMo 1B from the old OLMo trainer.

Return type:

TransformerConfig

classmethod olmo2_3B(vocab_size, **kwargs)[source]

A 3B OLMo2 model config.

Return type:

TransformerConfig

classmethod olmo2_7B(vocab_size, **kwargs)[source]

A 7B OLMo2 model config.

Return type:

TransformerConfig

classmethod olmo2_13B(vocab_size, **kwargs)[source]

A 13B OLMo2 model config.

Return type:

TransformerConfig

classmethod olmo2_32B(vocab_size, **kwargs)[source]

A 32B OLMo2 model config.

Return type:

TransformerConfig

classmethod olmo3_100M(vocab_size, **kwargs)[source]

A 100M OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_190M(vocab_size, **kwargs)[source]

A 190M OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_370M(vocab_size, **kwargs)[source]

A 370M OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_600M(vocab_size, **kwargs)[source]

A 600M OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_760M(vocab_size, **kwargs)[source]

A 760M OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_1B(vocab_size, **kwargs)[source]

A 1B OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_3B(vocab_size, **kwargs)[source]

A 3B OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_7B(vocab_size, **kwargs)[source]

A 7B OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_13B(vocab_size, **kwargs)[source]

A 13B OLMo3 model config.

Return type:

TransformerConfig

classmethod olmo3_32B(vocab_size, **kwargs)[source]

A 32B OLMo3 model config.

Return type:

TransformerConfig

classmethod ngpt_271M(vocab_size, **kwargs)[source]

A 271M nGPT model config.

Return type:

TransformerConfig

classmethod ngpt_1B(vocab_size, **kwargs)[source]

A 1B nGPT model config.

Return type:

TransformerConfig

classmethod llama2_271M(vocab_size, **kwargs)[source]

A 271M Llama2-like model config.

Return type:

TransformerConfig

classmethod llama2_1B(vocab_size, **kwargs)[source]

A 1B Llama2-like model config.

Note: Llama2 doesn’t have a 1B. We made this up.

Return type:

TransformerConfig

classmethod llama2_7B(vocab_size, **kwargs)[source]

A 7B Llama2-like model config.

Return type:

TransformerConfig

classmethod llama2_13B(vocab_size, **kwargs)[source]

A 7B Llama2-like model config.

Return type:

TransformerConfig

classmethod llama2_26B(vocab_size, **kwargs)[source]

A 26B Llama2-like model config.

Return type:

TransformerConfig

classmethod llama2_70B(vocab_size, **kwargs)[source]

A 70B Llama2-like model config.

Return type:

TransformerConfig

classmethod llama3_1B(vocab_size, **kwargs)[source]

A 1B Llama3-like model config.

Return type:

TransformerConfig

classmethod llama3_8B(vocab_size, **kwargs)[source]

An 8B Llama3-like model config.

Return type:

TransformerConfig

classmethod llama3_70B(vocab_size, **kwargs)[source]

A 70B Llama3-like model config.

Return type:

TransformerConfig

classmethod llama3_405B(vocab_size, **kwargs)[source]

A 405B Llama3-like model config.

Return type:

TransformerConfig

classmethod gemma3_1B(vocab_size=262208, **kwargs)[source]

Gemma 3 1B model config.

Return type:

TransformerConfig

classmethod gemma3_4B(vocab_size=262208, **kwargs)[source]

Gemma 3 4B model config.

Return type:

TransformerConfig

classmethod gemma3_12B(vocab_size=262208, **kwargs)[source]

Gemma 3 12B model config.

Return type:

TransformerConfig

classmethod gemma3_27B(vocab_size=262208, **kwargs)[source]

Gemma 3 27B model config.

Return type:

TransformerConfig

classmethod llama_like(*, d_model, vocab_size, n_layers, n_heads, n_kv_heads=None, head_dim=None, gate=None, qk_norm=False, use_head_qk_norm=False, layer_norm_eps=1e-05, layer_norm_name=None, rope_theta=500000, rope_type=None, rope_full_precision=True, no_global_rope=False, hidden_size_multiple_of=256, hidden_size_multiplier=None, fused_ops=False, use_flash=None, attn_backend=None, sliding_window=None, block_name='default', block_mods=None, dtype='float32', rope_scaling=None, feed_forward=None, feed_forward_moe=None, **kwargs)[source]

Create a Llama-like model configuration.

Parameters:
  • hidden_size_multiple_of (int, default: 256) – Ensure the FFN hidden size is a multiple of this value.

  • hidden_size_multiplier (Optional[float], default: None) – Custom multiplier for the FFN hidden size.

  • fused_ops (bool, default: False) – Use fused operations where possible.

  • layer_norm_name (Optional[LayerNormType], default: None) – Override the layer norm implementation. Defaults to LayerNormType.fused_rms when fused_ops=True, otherwise LayerNormType.rms.

  • block_mods (Optional[Dict[int, Callable[[TransformerBlockConfig], TransformerBlockConfig]]], default: None) – A dictionary of block indices to functions that take the base block config and return a modified block config.

  • dtype (DType, default: 'float32') – The default data type to use for all parameters.

Return type:

TransformerConfig

classmethod ngpt_like(*, d_model, vocab_size, n_layers, n_heads, n_kv_heads=None, qk_norm=True, rope_theta=500000, hidden_size_multiple_of=256, hidden_size_multiplier=None, use_flash=False, dtype='float32', **kwargs)[source]

Create an nGPT-like model configuration.

Return type:

TransformerConfig

classmethod gemma3_like(*, d_model, vocab_size, n_layers, n_heads, n_kv_heads, hidden_size, head_dim=None, gate=None, activation='gelu_tanh', local_window_size=1024, local_rope_theta=10000, global_rope_theta=1000000, global_layer_interval=6, layer_norm_eps=1e-06, fused_ops=False, use_flash=None, attn_backend=None, dtype='float32', **kwargs)[source]

Create a Gemma 3-like model configuration.

Gemma 3 features: - Hybrid local/global attention: 5 local layers with sliding window, then 1 global layer - Dual RoPE frequencies: local layers use 10K, global layers use 1M - QK-norm for attention score stabilization - GeGLU activation (GELU with tanh approximation)

Parameters:
  • local_window_size (int, default: 1024) – Sliding window size for local attention layers.

  • local_rope_theta (int, default: 10000) – RoPE base frequency for local attention layers.

  • global_rope_theta (int, default: 1000000) – RoPE base frequency for global attention layers.

  • global_layer_interval (int, default: 6) – Number of layers per pattern cycle (default 6 = 5 local + 1 global).

Return type:

TransformerConfig

with_rope_scaling(rope_scaling, full_attn_layers_only=True)[source]

Return a copy of this config with the given RoPE scaling scheme applied.

Return type:

TransformerConfig

class olmo_core.nn.transformer.Transformer(*, d_model, vocab_size, n_layers, block, lm_head, embedding_norm=None, dtype=torch.float32, init_method='normal', init_device='cpu', init_seed=0, init_std=0.02, embedding_init_std=None, block_overrides=None, block_pattern=None, embed_scale=None)[source]

Bases: Module

A typical “Llama-style” transformer implementation.

Parameters:
  • d_model (int) – The model dimensionality.

  • vocab_size (int) – The vocab size.

  • n_layers (int) – The number of transformer layers/blocks.

  • block (TransformerBlockConfig | dict[str, TransformerBlockConfig]) – The block configuration. Can be a single block config or a dict of named blocks.

  • layer_norm – The layer norm config for the final layer norm.

  • bias – Whether to use a bias in the final linear layer.

  • dtype (dtype, default: torch.float32) – The datatype to use for the linear output layer.

  • init_device (str, default: 'cpu') – The device used when initializing parameters.

  • init_seed (int, default: 0) – The seed used when initializing parameters.

  • init_std (float, default: 0.02) – The standard deviation used when initializing parameters.

  • embedding_init_std (Optional[float], default: None) – The standard deviation used when initializing the embeddings.

  • block_overrides (Optional[Dict[int, TransformerBlockConfig]], default: None) – Overrides for specific blocks. Not supported if block is a dict of named blocks.

  • block_pattern (Optional[List[str]], default: None) – The pattern of blocks to use. Required if block is a dict of named blocks.

  • embed_scale (Optional[float], default: None) – The scale factor for the embeddings.

get_rope_buffers(seq_len, device=None)[source]

Get the RoPE buffers to pass to each layer.

Return type:

Dict[int, Optional[RoPEBuffers]]

init_weights(*, max_seq_len=None, max_local_microbatch_size=None, device=None, world_mesh=None, model_part_idx=0)[source]

Initialize the model weights.

Parameters:
  • max_seq_len (Optional[int], default: None) – The maximum sequence length expected. This is used to warm up the RoPE cache.

  • max_local_microbatch_size (Optional[int], default: None) – The maximum local (rank) micro-batch size (in tokens) expected. This is used to warm-up some MoE cache.

  • device (Optional[device], default: None) – The device the local copy of the model will be trained on.

  • model_part_idx (int, default: 0) – The local index of this model part on the current rank. With interleaved pipeline schedules a single rank can own multiple model chunks, and each must receive a distinct seed; otherwise their parameters would be identical.

Return type:

Generator

forward(input_ids, *, labels=None, ignore_index=-100, loss_reduction='mean', z_loss_multiplier=None, loss_div_factor=None, return_logits=None, logits_to_keep=0, **kwargs)[source]

Run the transformer on the token input IDs.

Parameters:
  • input_ids (Tensor) – The token input IDs, shape (batch_size, seq_len).

  • labels (Optional[Tensor], default: None) – The token labels, shape (batch_size, seq_len).

  • ignore_index (int, default: -100) – The index to ignore in the loss computation. Default is -100.

  • loss_reduction (Literal['mean', 'sum', 'none'], default: 'mean') – The reduction method for the loss. Can be “mean”, “sum”, or “none”.

  • z_loss_multiplier (Optional[float], default: None) – Optional multiplier for the z-loss regularization term.

  • loss_div_factor (Union[Tensor, float, None], default: None) – Optional divisor for the loss, can be a scalar or tensor.

  • return_logits (Optional[bool], default: None) – Whether to return logits along with the loss when labels are provided.

  • logits_to_keep (Union[int, Tensor], default: 0) – Number of positions to keep from the end of the sequence (if int), or tensor specifying which positions to keep. Default is 0 (keep all).

Return type:

Union[Tensor, LMOutputWithLoss]

Returns:

The logits if labels is None or the losses if labels is not None.

apply_fp8(float8_config)[source]

Use an FP8 recipe on most linear layers.

apply_pp(pp_mesh)[source]

Prepare the model for pipeline parallelism after it’s been split into stages.

apply_tp(tp_mesh, float8_enabled=None)[source]

Apply tensor parallelism to the model.

Parameters:
  • loss_parallel – Set to True if parallelizing the loss function as well.

  • float8_enabled (Optional[bool], default: None) – Set this to True if training with float8 linear layers.

apply_cp(cp_mesh, ring=None, uly=None)[source]

Prepare the model for context-parallelism (CP).

Parameters:
apply_activation_checkpointing(mode, block_interval=None, modules=None, activation_memory_budget=None)[source]

Apply activation checkpointing to the model.

Parameters:
  • mode (TransformerActivationCheckpointingMode) – Determines how to apply activation checkpointing.

  • block_interval (Optional[int], default: None) – Required when mode is “selected_blocks”. Determines which blocks are wrapped.

  • modules (Optional[List[str]], default: None) – Required when mode is “selected_modules”. A list of modules names to wrap for activation checkpointing. Globs are supported.

  • activation_memory_budget (Optional[float], default: None) – The memory budget for activation checkpointing in the range [0, 1]. 0 corresponds to the memory usage when recomputing all activations, and 1 corresponds to the memory usage when recomputing no activations (which is the default). Requires compilation to be enabled.

apply_compile()[source]

Apply torch.compile() to each transformer block, which makes compilation efficient due to repeated structure.

Warning

This must be called after apply_activation_checkpointing() but before apply_fsdp() or apply_ddp().

apply_fsdp(dp_mesh=None, param_dtype=None, reduce_dtype=torch.float32, pp_enabled=False, prefetch_factor=0, wrapping_strategy='full')[source]

Apply FSDP(2) to the model.

Warning

This should generally be called last if using any other parallelism strategies or optimizations like apply_compile().

Parameters:
  • dp_mesh (Optional[DeviceMesh], default: None) – The model data parallel device mesh.

  • param_dtype (Optional[dtype], default: None) – The data type to materialize params in. Defaults to the current param dtype.

  • reduce_dtype (dtype, default: torch.float32) – The data type for gradient reduction.

Pp_enabled:

If pipeline parallelism is also enabled.

Prefetch_factor:

For tuning the prefetch settings. 0 is the default, and higher values result in more aggressive prefetching.

Wrapping_strategy:

The wrapping strategy.

apply_ddp(dp_mesh=None, param_dtype=None, compile_enabled=False, autograd_compile_enabled=False)[source]

Apply DDP to the model.

num_flops_per_token(seq_len)[source]

Returns the idealized number of flops per token for the given sequence length. Purposefully does not account for wasted flops due to padding, recomputation, etc.

Return type:

int

post_batch(dry_run=False)[source]

Should be called right after the final backward of a complete batch but before the optimizer step.

post_optim_step()[source]

Should be called right after an optimizer step.

class olmo_core.nn.transformer.NormalizedTransformer(*, d_model, vocab_size, n_layers, block, lm_head, dtype=torch.float32, init_method='normalized', init_device='cpu', init_seed=0, init_std=0.02, embedding_init_std=None, block_overrides=None, block_pattern=None)[source]

Bases: Transformer

A nGPT transformer implementation, to be used with the NormalizedTransformerBlock block type.

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

init_weights(*args, **kwargs)[source]

Initialize the model weights.

Parameters:
  • max_seq_len – The maximum sequence length expected. This is used to warm up the RoPE cache.

  • max_local_microbatch_size – The maximum local (rank) micro-batch size (in tokens) expected. This is used to warm-up some MoE cache.

  • device – The device the local copy of the model will be trained on.

  • model_part_idx – The local index of this model part on the current rank. With interleaved pipeline schedules a single rank can own multiple model chunks, and each must receive a distinct seed; otherwise their parameters would be identical.

Return type:

Generator

normalize_matrices()[source]

Normalize the weights in all matrices. This should be called after each optimizer step, which the TransformerTrainModule will handle for you.

apply_tp(tp_mesh, float8_enabled=None)[source]

Apply tensor parallelism to the model.

Parameters:
  • loss_parallel – Set to True if parallelizing the loss function as well.

  • float8_enabled (Optional[bool], default: None) – Set this to True if training with float8 linear layers.

apply_compile()[source]

Apply torch.compile() to each transformer block, which makes compilation efficient due to repeated structure.

Warning

This must be called after apply_activation_checkpointing() but before apply_fsdp() or apply_ddp().

post_optim_step()[source]

Should be called right after an optimizer step.

class olmo_core.nn.transformer.MoETransformer(*, d_model, vocab_size, n_layers, block, lm_head, embedding_norm=None, dtype=torch.float32, init_method='normal', init_device='cpu', init_seed=0, init_std=0.02, embedding_init_std=None, block_overrides=None, block_pattern=None, embed_scale=None)[source]

Bases: Transformer

An MoE transformer implementation, to be used with one of the MoETransformerBlock block types.

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

post_batch(dry_run=False)[source]

Should be called right after the final backward of a complete batch but before the optimizer step.

class olmo_core.nn.transformer.MoEHybridTransformerBlockBase(*, d_model, n_layers, sequence_mixer, layer_norm, feed_forward, init_device='cpu', **kwargs)[source]

Bases: MoETransformerBlock

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.MoEHybridTransformerBlock(*, d_model, n_layers, sequence_mixer, layer_norm, feed_forward, init_device='cpu', **kwargs)[source]

Bases: MoEHybridTransformerBlockBase

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

class olmo_core.nn.transformer.MoEHybridReorderedNormTransformerBlock(*, d_model, n_layers, sequence_mixer, layer_norm, feed_forward, init_device='cpu', **kwargs)[source]

Bases: MoEHybridTransformerBlockBase

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

class olmo_core.nn.transformer.TransformerBlockType(value)[source]

Bases: StrEnum

An enumeration of the different transformer block implementations.

default = 'default'

➡️ TransformerBlock

default_scaled = 'default_scaled'

➡️ LayerNormScaledTransformerBlock (applies LayerNorm Scaling)

reordered_norm = 'reordered_norm'

➡️ ReorderedNormTransformerBlock

peri_norm = 'peri_norm'

➡️ PeriNormTransformerBlock

normalized = 'normalized'

➡️ NormalizedTransformerBlock

moe = 'moe'

➡️ MoETransformerBlock

moe_reordered_norm = 'moe_reordered_norm'

➡️ MoEReorderedNormTransformerBlock

moe_hybrid = 'moe_hybrid'

➡️ MoEHybridTransformerBlock

moe_hybrid_reordered_norm = 'moe_hybrid_reordered_norm'

➡️ MoEHybridReorderedNormTransformerBlock

class olmo_core.nn.transformer.TransformerBlockConfig(sequence_mixer=<UNSET>, attention=None, layer_norm=None, feed_forward=None, feed_forward_moe=None, name='default', dropout=None, attention_residual_alpha=None, feed_forward_residual_alpha=None)[source]

Bases: ModuleConfig

A configuration class for easily building transformer blocks.

sequence_mixer: SequenceMixerConfig = <UNSET>

The sequence mixer config (e.g. attention, recurrent, convolution, etc.).

attention: InitVar = None

Deprecated since version Use: sequence_mixer instead. This field is only kept for backwards compatibility with old configs that used attention: AttentionConfig.

layer_norm: Optional[LayerNormConfig] = None

The layer norm config.

feed_forward: Optional[FeedForwardConfig] = None

The feed-forward config, required for non-MoE blocks.

feed_forward_moe: Optional[MoEConfig] = None

The config for the MoE feed-forward layer. Required for MoE blocks.

name: TransformerBlockType = 'default'

The block type.

dropout: Optional[float] = None

Dropout probability.

attention_residual_alpha: Optional[float] = None

A scaling factor applied to the attention/recurrent output before adding it to the residual stream.

feed_forward_residual_alpha: Optional[float] = None

A scaling factor applied to the feed-forward (MLP) output before adding it to the residual stream.

build(*, d_model, block_idx, n_layers, init_device='cpu', cache=None)[source]

Build the corresponding module.

Return type:

TransformerBlockBase

class olmo_core.nn.transformer.TransformerBlockBase(*, n_layers)[source]

Bases: Module

Base class for transformer block implementations.

abstract forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.TransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]

Bases: TransformerBlockBase

A typical “Llama-style” transformer block implementation.

Parameters:
  • d_model (int) – The model dimensionality.

  • block_idx (int) – The index/position of the block within the model. Ranges from 0 to n_layers - 1.

  • sequence_mixer (SequenceMixerConfig) – The sequence mixer module config (e.g. attention, recurrent, convolution, etc.).

  • feed_forward (FeedForwardConfig) – The feed forward module config.

  • layer_norm (LayerNormConfig) – The layer norm config for both the attention LN and the feed forward LN.

  • dropout (float, default: 0.0) – Dropout probability.

  • init_device (str, default: 'cpu') – The device used when initializing parameters.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.ReorderedNormTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]

Bases: TransformerBlock

Like TransformerBlock except that the attention norm is applied on the output of attention instead of the input, and likewise the feed-forward norm is applied on the output of the feed-forward instead of the input.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.LayerNormScaledTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]

Bases: TransformerBlock

A variant of TransformerBlock that applies LayerNorm Scaling (LNS).

Each LayerNorm output is multiplied by 1 / sqrt(layer_id) where layer_id is the 1-based position of the block inside the transformer. Keeping this logic in a dedicated subclass ensures that the vanilla TransformerBlock remains simple and easy to reason about.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.PeriNormTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]

Bases: TransformerBlock

A transformer block in the style of Peri-LN.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.NormalizedTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, init_device='cpu', cache=None)[source]

Bases: TransformerBlockBase

An nGPT block implementation to be used with the NormalizedAttention attention type and NormalizedFeedForward feed-forward type.

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

normalize_matrices()[source]

Normalize the weights in all matrices. This should be called after each optimizer step, which the TransformerTrainModule will handle for you.

class olmo_core.nn.transformer.MoETransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward_moe, layer_norm, dropout=0.0, init_device='cpu', cache=None)[source]

Bases: TransformerBlockBase

Like TransformerBlock except that the dense FeedForward module is replaced with a mixture-of-experts (MoE).

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.MoEReorderedNormTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward_moe, layer_norm, dropout=0.0, init_device='cpu', cache=None)[source]

Bases: MoETransformerBlock

Like MoETransformerBlock except that the attention norm is applied on the output of attention instead of the input, and likewise the feed-forward norm is applied on the output of the feed-forward MoE instead of the input.

Warning

This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.

forward(x, *, loss_div_factor=None, **kwargs)[source]

Run the block on the input x.

Parameters:

x (Tensor) – The input of shape (batch_size, seq_len, d_model).

Return type:

Tensor

class olmo_core.nn.transformer.InitMethod(value)[source]

Bases: StrEnum

An enumeration.

normal = 'normal'

Every linear and embedding layer and initialized from a truncated normal distributed with standard deviation 0.02.

normalized = 'normalized'

Follow the nGPT initialization scheme.

llama = 'llama'

Like normal, but “output” layers are initialized with a standard deviation that’s dependent on either d_model or the number of layers.

llama_depth = 'llama_depth'

Like normal, but “output” layers are initialized with a standard deviation that’s dependent on either d_model or the layer index.

fan_in = 'fan_in'

Per-layer fan-in initialization where each weight matrix is initialized with std = 1/√d_in where d_in is the fan-in (number of input features) of that specific layer. Embeddings use std = 1.0 with normal distribution. This provides forward-pass variance-preserving initialization adapted to each layer’s specific dimensions, with no depth scaling.