nn.transformer¶
- class olmo_core.nn.transformer.TransformerType(value)[source]¶
Bases:
StrEnumAn enumeration of transformer implementations.
- default = 'default'¶
➡️
Transformer
- normalized = 'normalized'¶
➡️
NormalizedTransformer(nGPT)
- moe = 'moe'¶
- class olmo_core.nn.transformer.TransformerConfig(d_model, vocab_size, n_layers, block, lm_head, embedding_norm=None, name='default', dtype='float32', init_method='normal', init_seed=0, init_std=0.02, embedding_init_std=None, freeze_params=None, block_pattern=None, block_overrides=None, embed_scale=None)[source]¶
Bases:
ModelConfigA config for easily building transformer models.
- Parameters:
name (
TransformerType, default:'default') – The name of the implementation.
See
Transformerfor a description of the other parameters.- build(*, init_device='cpu')[source]¶
Build the model corresponding to this config.
- Parameters:
init_device (
str, default:'cpu') – The device to put the parameters on during initialization. In a distributed setting it usually makes sense to set this to “meta”.- Return type:
- property num_active_params: int¶
The total number of active parameters that a model from this config would have.
- property num_active_non_embedding_params: int¶
The number of active parameters excluding embedding parameters.
- classmethod olmo2_1B(vocab_size, **kwargs)[source]¶
A 1B OLMo2 model config.
This is different from the OLMo 1B from the old OLMo trainer.
- Return type:
- classmethod olmo2_1B_v2(vocab_size, **kwargs)[source]¶
A 1B OLMo2 model config.
This matches the OLMo 1B from the old OLMo trainer.
- Return type:
- classmethod llama2_271M(vocab_size, **kwargs)[source]¶
A 271M Llama2-like model config.
- Return type:
- classmethod llama2_1B(vocab_size, **kwargs)[source]¶
A 1B Llama2-like model config.
Note: Llama2 doesn’t have a 1B. We made this up.
- Return type:
- classmethod llama3_405B(vocab_size, **kwargs)[source]¶
A 405B Llama3-like model config.
- Return type:
- classmethod llama_like(*, d_model, vocab_size, n_layers, n_heads, n_kv_heads=None, head_dim=None, gate=None, qk_norm=False, use_head_qk_norm=False, layer_norm_eps=1e-05, layer_norm_name=None, rope_theta=500000, rope_type=None, rope_full_precision=True, no_global_rope=False, hidden_size_multiple_of=256, hidden_size_multiplier=None, fused_ops=False, use_flash=None, attn_backend=None, sliding_window=None, block_name='default', block_mods=None, dtype='float32', rope_scaling=None, feed_forward=None, feed_forward_moe=None, **kwargs)[source]¶
Create a Llama-like model configuration.
- Parameters:
hidden_size_multiple_of (
int, default:256) – Ensure the FFN hidden size is a multiple of this value.hidden_size_multiplier (
Optional[float], default:None) – Custom multiplier for the FFN hidden size.fused_ops (
bool, default:False) – Use fused operations where possible.layer_norm_name (
Optional[LayerNormType], default:None) – Override the layer norm implementation. Defaults toLayerNormType.fused_rmswhenfused_ops=True, otherwiseLayerNormType.rms.block_mods (
Optional[Dict[int,Callable[[TransformerBlockConfig],TransformerBlockConfig]]], default:None) – A dictionary of block indices to functions that take the base block config and return a modified block config.dtype (
DType, default:'float32') – The default data type to use for all parameters.
- Return type:
- classmethod ngpt_like(*, d_model, vocab_size, n_layers, n_heads, n_kv_heads=None, qk_norm=True, rope_theta=500000, hidden_size_multiple_of=256, hidden_size_multiplier=None, use_flash=False, dtype='float32', **kwargs)[source]¶
Create an nGPT-like model configuration.
- Return type:
- classmethod gemma3_like(*, d_model, vocab_size, n_layers, n_heads, n_kv_heads, hidden_size, head_dim=None, gate=None, activation='gelu_tanh', local_window_size=1024, local_rope_theta=10000, global_rope_theta=1000000, global_layer_interval=6, layer_norm_eps=1e-06, fused_ops=False, use_flash=None, attn_backend=None, dtype='float32', **kwargs)[source]¶
Create a Gemma 3-like model configuration.
Gemma 3 features: - Hybrid local/global attention: 5 local layers with sliding window, then 1 global layer - Dual RoPE frequencies: local layers use 10K, global layers use 1M - QK-norm for attention score stabilization - GeGLU activation (GELU with tanh approximation)
- Parameters:
local_window_size (
int, default:1024) – Sliding window size for local attention layers.local_rope_theta (
int, default:10000) – RoPE base frequency for local attention layers.global_rope_theta (
int, default:1000000) – RoPE base frequency for global attention layers.global_layer_interval (
int, default:6) – Number of layers per pattern cycle (default 6 = 5 local + 1 global).
- Return type:
- class olmo_core.nn.transformer.Transformer(*, d_model, vocab_size, n_layers, block, lm_head, embedding_norm=None, dtype=torch.float32, init_method='normal', init_device='cpu', init_seed=0, init_std=0.02, embedding_init_std=None, block_overrides=None, block_pattern=None, embed_scale=None)[source]¶
Bases:
ModuleA typical “Llama-style” transformer implementation.
- Parameters:
d_model (
int) – The model dimensionality.vocab_size (
int) – The vocab size.n_layers (
int) – The number of transformer layers/blocks.block (
TransformerBlockConfig|dict[str,TransformerBlockConfig]) – The block configuration. Can be a single block config or a dict of named blocks.layer_norm – The layer norm config for the final layer norm.
bias – Whether to use a bias in the final linear layer.
dtype (
dtype, default:torch.float32) – The datatype to use for the linear output layer.init_device (
str, default:'cpu') – The device used when initializing parameters.init_seed (
int, default:0) – The seed used when initializing parameters.init_std (
float, default:0.02) – The standard deviation used when initializing parameters.embedding_init_std (
Optional[float], default:None) – The standard deviation used when initializing the embeddings.block_overrides (
Optional[Dict[int,TransformerBlockConfig]], default:None) – Overrides for specific blocks. Not supported if block is a dict of named blocks.block_pattern (
Optional[List[str]], default:None) – The pattern of blocks to use. Required if block is a dict of named blocks.embed_scale (
Optional[float], default:None) – The scale factor for the embeddings.
- init_weights(*, max_seq_len=None, max_local_microbatch_size=None, device=None, world_mesh=None, model_part_idx=0)[source]¶
Initialize the model weights.
- Parameters:
max_seq_len (
Optional[int], default:None) – The maximum sequence length expected. This is used to warm up the RoPE cache.max_local_microbatch_size (
Optional[int], default:None) – The maximum local (rank) micro-batch size (in tokens) expected. This is used to warm-up some MoE cache.device (
Optional[device], default:None) – The device the local copy of the model will be trained on.model_part_idx (
int, default:0) – The local index of this model part on the current rank. With interleaved pipeline schedules a single rank can own multiple model chunks, and each must receive a distinct seed; otherwise their parameters would be identical.
- Return type:
- forward(input_ids, *, labels=None, ignore_index=-100, loss_reduction='mean', z_loss_multiplier=None, loss_div_factor=None, return_logits=None, logits_to_keep=0, **kwargs)[source]¶
Run the transformer on the token input IDs.
- Parameters:
input_ids (
Tensor) – The token input IDs, shape(batch_size, seq_len).labels (
Optional[Tensor], default:None) – The token labels, shape(batch_size, seq_len).ignore_index (
int, default:-100) – The index to ignore in the loss computation. Default is -100.loss_reduction (
Literal['mean','sum','none'], default:'mean') – The reduction method for the loss. Can be “mean”, “sum”, or “none”.z_loss_multiplier (
Optional[float], default:None) – Optional multiplier for the z-loss regularization term.loss_div_factor (
Union[Tensor,float,None], default:None) – Optional divisor for the loss, can be a scalar or tensor.return_logits (
Optional[bool], default:None) – Whether to return logits along with the loss when labels are provided.logits_to_keep (
Union[int,Tensor], default:0) – Number of positions to keep from the end of the sequence (if int), or tensor specifying which positions to keep. Default is 0 (keep all).
- Return type:
Union[Tensor,LMOutputWithLoss]- Returns:
The logits if
labelsisNoneor the losses iflabelsis notNone.
- apply_pp(pp_mesh)[source]¶
Prepare the model for pipeline parallelism after it’s been split into stages.
- apply_cp(cp_mesh, ring=None, uly=None)[source]¶
Prepare the model for context-parallelism (CP).
- Parameters:
cp_mesh (
DeviceMesh) – The CP device mesh.ring (
Optional[RingContextParallelStyle], default:None) – The ring context parallel style.uly (
Optional[UlyssesContextParallelStyle], default:None) – The ulysses context parallel style.
- apply_activation_checkpointing(mode, block_interval=None, modules=None, activation_memory_budget=None)[source]¶
Apply activation checkpointing to the model.
- Parameters:
mode (
TransformerActivationCheckpointingMode) – Determines how to apply activation checkpointing.block_interval (
Optional[int], default:None) – Required whenmodeis “selected_blocks”. Determines which blocks are wrapped.modules (
Optional[List[str]], default:None) – Required whenmodeis “selected_modules”. A list of modules names to wrap for activation checkpointing. Globs are supported.activation_memory_budget (
Optional[float], default:None) – The memory budget for activation checkpointing in the range [0, 1]. 0 corresponds to the memory usage when recomputing all activations, and 1 corresponds to the memory usage when recomputing no activations (which is the default). Requires compilation to be enabled.
- apply_compile()[source]¶
Apply
torch.compile()to each transformer block, which makes compilation efficient due to repeated structure.Warning
This must be called after
apply_activation_checkpointing()but beforeapply_fsdp()orapply_ddp().
- apply_fsdp(dp_mesh=None, param_dtype=None, reduce_dtype=torch.float32, pp_enabled=False, prefetch_factor=0, wrapping_strategy='full')[source]¶
Apply FSDP(2) to the model.
Warning
This should generally be called last if using any other parallelism strategies or optimizations like
apply_compile().- Parameters:
dp_mesh (
Optional[DeviceMesh], default:None) – The model data parallel device mesh.param_dtype (
Optional[dtype], default:None) – The data type to materialize params in. Defaults to the current param dtype.reduce_dtype (
dtype, default:torch.float32) – The data type for gradient reduction.
- Pp_enabled:
If pipeline parallelism is also enabled.
- Prefetch_factor:
For tuning the prefetch settings. 0 is the default, and higher values result in more aggressive prefetching.
- Wrapping_strategy:
The wrapping strategy.
- apply_ddp(dp_mesh=None, param_dtype=None, compile_enabled=False, autograd_compile_enabled=False)[source]¶
Apply DDP to the model.
- num_flops_per_token(seq_len)[source]¶
Returns the idealized number of flops per token for the given sequence length. Purposefully does not account for wasted flops due to padding, recomputation, etc.
- Return type:
- class olmo_core.nn.transformer.NormalizedTransformer(*, d_model, vocab_size, n_layers, block, lm_head, dtype=torch.float32, init_method='normalized', init_device='cpu', init_seed=0, init_std=0.02, embedding_init_std=None, block_overrides=None, block_pattern=None)[source]¶
Bases:
TransformerA nGPT transformer implementation, to be used with the
NormalizedTransformerBlockblock type.Warning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- init_weights(*args, **kwargs)[source]¶
Initialize the model weights.
- Parameters:
max_seq_len – The maximum sequence length expected. This is used to warm up the RoPE cache.
max_local_microbatch_size – The maximum local (rank) micro-batch size (in tokens) expected. This is used to warm-up some MoE cache.
device – The device the local copy of the model will be trained on.
model_part_idx – The local index of this model part on the current rank. With interleaved pipeline schedules a single rank can own multiple model chunks, and each must receive a distinct seed; otherwise their parameters would be identical.
- Return type:
- normalize_matrices()[source]¶
Normalize the weights in all matrices. This should be called after each optimizer step, which the
TransformerTrainModulewill handle for you.
- class olmo_core.nn.transformer.MoETransformer(*, d_model, vocab_size, n_layers, block, lm_head, embedding_norm=None, dtype=torch.float32, init_method='normal', init_device='cpu', init_seed=0, init_std=0.02, embedding_init_std=None, block_overrides=None, block_pattern=None, embed_scale=None)[source]¶
Bases:
TransformerAn MoE transformer implementation, to be used with one of the
MoETransformerBlockblock types.Warning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- class olmo_core.nn.transformer.MoEHybridTransformerBlockBase(*, d_model, n_layers, sequence_mixer, layer_norm, feed_forward, init_device='cpu', **kwargs)[source]¶
Bases:
MoETransformerBlockWarning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- class olmo_core.nn.transformer.MoEHybridTransformerBlock(*, d_model, n_layers, sequence_mixer, layer_norm, feed_forward, init_device='cpu', **kwargs)[source]¶
Bases:
MoEHybridTransformerBlockBaseWarning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- class olmo_core.nn.transformer.MoEHybridReorderedNormTransformerBlock(*, d_model, n_layers, sequence_mixer, layer_norm, feed_forward, init_device='cpu', **kwargs)[source]¶
Bases:
MoEHybridTransformerBlockBaseWarning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- class olmo_core.nn.transformer.TransformerBlockType(value)[source]¶
Bases:
StrEnumAn enumeration of the different transformer block implementations.
- default = 'default'¶
- default_scaled = 'default_scaled'¶
➡️
LayerNormScaledTransformerBlock(applies LayerNorm Scaling)
- reordered_norm = 'reordered_norm'¶
- peri_norm = 'peri_norm'¶
- normalized = 'normalized'¶
- moe = 'moe'¶
- moe_reordered_norm = 'moe_reordered_norm'¶
- moe_hybrid = 'moe_hybrid'¶
- moe_hybrid_reordered_norm = 'moe_hybrid_reordered_norm'¶
- class olmo_core.nn.transformer.TransformerBlockConfig(sequence_mixer=<UNSET>, attention=None, layer_norm=None, feed_forward=None, feed_forward_moe=None, name='default', dropout=None, attention_residual_alpha=None, feed_forward_residual_alpha=None)[source]¶
Bases:
ModuleConfigA configuration class for easily building transformer blocks.
-
sequence_mixer:
SequenceMixerConfig= <UNSET>¶ The sequence mixer config (e.g. attention, recurrent, convolution, etc.).
-
attention:
InitVar= None¶ Deprecated since version Use:
sequence_mixerinstead. This field is only kept for backwards compatibility with old configs that usedattention: AttentionConfig.
-
layer_norm:
Optional[LayerNormConfig] = None¶ The layer norm config.
-
feed_forward:
Optional[FeedForwardConfig] = None¶ The feed-forward config, required for non-MoE blocks.
-
feed_forward_moe:
Optional[MoEConfig] = None¶ The config for the MoE feed-forward layer. Required for MoE blocks.
-
name:
TransformerBlockType= 'default'¶ The block type.
-
attention_residual_alpha:
Optional[float] = None¶ A scaling factor applied to the attention/recurrent output before adding it to the residual stream.
-
feed_forward_residual_alpha:
Optional[float] = None¶ A scaling factor applied to the feed-forward (MLP) output before adding it to the residual stream.
-
sequence_mixer:
- class olmo_core.nn.transformer.TransformerBlockBase(*, n_layers)[source]¶
Bases:
ModuleBase class for transformer block implementations.
- class olmo_core.nn.transformer.TransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]¶
Bases:
TransformerBlockBaseA typical “Llama-style” transformer block implementation.
- Parameters:
d_model (
int) – The model dimensionality.block_idx (
int) – The index/position of the block within the model. Ranges from 0 ton_layers - 1.sequence_mixer (
SequenceMixerConfig) – The sequence mixer module config (e.g. attention, recurrent, convolution, etc.).feed_forward (
FeedForwardConfig) – The feed forward module config.layer_norm (
LayerNormConfig) – The layer norm config for both the attention LN and the feed forward LN.dropout (
float, default:0.0) – Dropout probability.init_device (
str, default:'cpu') – The device used when initializing parameters.
- class olmo_core.nn.transformer.ReorderedNormTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]¶
Bases:
TransformerBlockLike
TransformerBlockexcept that the attention norm is applied on the output of attention instead of the input, and likewise the feed-forward norm is applied on the output of the feed-forward instead of the input.
- class olmo_core.nn.transformer.LayerNormScaledTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]¶
Bases:
TransformerBlockA variant of
TransformerBlockthat applies LayerNorm Scaling (LNS).Each LayerNorm output is multiplied by
1 / sqrt(layer_id)wherelayer_idis the 1-based position of the block inside the transformer. Keeping this logic in a dedicated subclass ensures that the vanillaTransformerBlockremains simple and easy to reason about.
- class olmo_core.nn.transformer.PeriNormTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, layer_norm, dropout=0.0, attention_residual_alpha=1.0, feed_forward_residual_alpha=1.0, init_device='cpu', cache=None)[source]¶
Bases:
TransformerBlockA transformer block in the style of Peri-LN.
- class olmo_core.nn.transformer.NormalizedTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward, init_device='cpu', cache=None)[source]¶
Bases:
TransformerBlockBaseAn nGPT block implementation to be used with the
NormalizedAttentionattention type andNormalizedFeedForwardfeed-forward type.Warning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- normalize_matrices()[source]¶
Normalize the weights in all matrices. This should be called after each optimizer step, which the
TransformerTrainModulewill handle for you.
- class olmo_core.nn.transformer.MoETransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward_moe, layer_norm, dropout=0.0, init_device='cpu', cache=None)[source]¶
Bases:
TransformerBlockBaseLike
TransformerBlockexcept that the denseFeedForwardmodule is replaced with a mixture-of-experts (MoE).Warning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- class olmo_core.nn.transformer.MoEReorderedNormTransformerBlock(*, d_model, block_idx, n_layers, sequence_mixer, feed_forward_moe, layer_norm, dropout=0.0, init_device='cpu', cache=None)[source]¶
Bases:
MoETransformerBlockLike
MoETransformerBlockexcept that the attention norm is applied on the output of attention instead of the input, and likewise the feed-forward norm is applied on the output of the feed-forward MoE instead of the input.Warning
This is a beta feature! The API is subject to change even with minor and patch releases. If you choose to use this feature please read the CHANGELOG before upgrading your version of this library.
- class olmo_core.nn.transformer.InitMethod(value)[source]¶
Bases:
StrEnumAn enumeration.
- normal = 'normal'¶
Every linear and embedding layer and initialized from a truncated normal distributed with standard deviation 0.02.
- normalized = 'normalized'¶
Follow the nGPT initialization scheme.
- llama = 'llama'¶
Like
normal, but “output” layers are initialized with a standard deviation that’s dependent on eitherd_modelor the number of layers.
- llama_depth = 'llama_depth'¶
Like
normal, but “output” layers are initialized with a standard deviation that’s dependent on eitherd_modelor the layer index.
- fan_in = 'fan_in'¶
Per-layer fan-in initialization where each weight matrix is initialized with
std = 1/√d_inwhered_inis the fan-in (number of input features) of that specific layer. Embeddings usestd = 1.0with normal distribution. This provides forward-pass variance-preserving initialization adapted to each layer’s specific dimensions, with no depth scaling.