`nn.rope`¶

class olmo_core.nn.rope.RoPEType(value)[source]¶

Bases: StrEnum

An enumeration of the different RoPE implementations.

default = 'default'¶: ➡️ RotaryEmbedding

fused = 'fused'¶: ➡️ FusedRotaryEmbedding

complex = 'complex'¶: ➡️ ComplexRotaryEmbedding

class olmo_core.nn.rope.RoPEConfig(name='default', theta=500000, full_precision=True, no_global_rope=False, scaling=None, partial_rotary_factor=1.0)[source]¶

Bases: ModuleConfig

A config for conveniently building any of the different RoPE classes.

See the individual RotaryEmbedding subclasses for a description of the configuration options.

name: RoPEType = 'default'¶: The name of the implementation.

theta: int = 500000¶: The base frequency parameter for the RoPE.

full_precision: bool = True¶: Whether to always apply RoPE in full precision regardless of the input data type.

no_global_rope: bool = False¶: Whether to disable RoPE on global (non-SWA) attention layers.

scaling: Optional[RoPEScalingConfig] = None¶: The scaling config to apply to RoPE.

partial_rotary_factor: float = 1.0¶: Fraction of each head dimension to apply RoPE to. When less than 1.0, only the leading int(head_size * partial_rotary_factor) dimensions are rotated; the rest pass through unchanged. Used by Qwen3.5 (default 0.25).

build(head_size, cache=None)[source]¶

Construct the corresponding RoPE class.

Parameters:: head_size (int) – The size of the attention heads.
Return type:: RotaryEmbeddingBase

class olmo_core.nn.rope.RoPEScalingConfig[source]¶

Bases: Config

Base class for RoPE scaling configs. Defines a strategy for scaling RoPE to longer sequences.

abstract compute_scaled_inv_freq(theta, dim, device)[source]¶

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:: tuple[torch.Tensor, float]

abstract to_hf_config()[source]¶

Convert to HuggingFace rope_scaling format.

Return type:: dict

class olmo_core.nn.rope.ABFRoPEScalingConfig(attention_rescale_factor=1.0, new_theta=8000000)[source]¶

Bases: RoPEScalingConfig

Absolute base frequency scaling (ABF). Simply uses a new base frequency parameter.

attention_rescale_factor: float = 1.0¶: Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.

compute_scaled_inv_freq(theta, dim, device)[source]¶

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:: tuple[torch.Tensor, float]

to_hf_config()[source]¶

ABF scaling doesn’t have a direct HF equivalent (just modify the config’s base frequency).

Return type:: dict

class olmo_core.nn.rope.PIRoPEScalingConfig(attention_rescale_factor=1.0, factor=2.0)[source]¶

Bases: RoPEScalingConfig

Position-Interpolation (PI) RoPE scaling from Chen et al. (https://arxiv.org/pdf/2306.15595)

Interpolate the rotary angles instead of extrapolating them when the context window at inference time exceeds the window used during training. In practice, this amounts to linearly compressing the original position indices by a constant factor factor.

attention_rescale_factor: float = 1.0¶: Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.

factor: float = 2.0¶: Context expansion multiplier. If factor = 1, reduces to vanilla RoPE.

compute_scaled_inv_freq(theta, dim, device)[source]¶

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:: tuple[torch.Tensor, float]

to_hf_config()[source]¶

PI scaling corresponds to HF’s linear scaling.

Return type:: dict

class olmo_core.nn.rope.StepwiseRoPEScalingConfig(attention_rescale_factor=1.0, factor=32.0, low_freq_proportion=0.0, high_freq_proportion=0.25, old_context_len=8192)[source]¶

Bases: RoPEScalingConfig

Step-wise RoPE scaling (aka “Per-frequency” scaling or Llama-3.1 scaling).

Reference: Llama-3.1-8B README

Scales RoPE to longer sequence lengths by interpolating between high- and low-frequency components.

High-frequency band (short wavelengths) – keeps the original frequencies unchanged.
These correspond to the very first dimensions of the rotary embedding and already encode short-range ordering well.
Low-frequency band (long wavelengths) – divides the original inverse frequency by
factor (equivalently, multiplies the wavelength by factor). This has the effect of spreading the very low frequencies across a longer context window (similar to PI scaling).
Medium-frequency band – linearly interpolates (in inverse-frequency space) between the
unscaled and the fully-scaled value so that the full spectrum changes smoothly.

attention_rescale_factor: float = 1.0¶: Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.

factor: float = 32.0¶: Context expansion multiplier applied to the long-wavelength part of the spectrum.

low_freq_proportion: float = 0.0¶: Proportion of the spectrum that is considered low-frequency. Is translated into a concrete wavelength that represents the upper bound of the low-frequency band.

high_freq_proportion: float = 0.25¶: Proportion of the spectrum that is considered high-frequency. Is translated into a concrete wavelength that represents the lower bound of the high-frequency band.

old_context_len: int = 8192¶: Maximum sequence length the base model was originally trained with.

compute_scaled_inv_freq(theta, dim, device)[source]¶

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:: tuple[torch.Tensor, float]

to_hf_config()[source]¶

Stepwise scaling corresponds to HF’s llama3 scaling.

Return type:: dict

class olmo_core.nn.rope.YaRNRoPEScalingConfig(factor=8.0, beta_fast=32, beta_slow=1, old_context_len=8192)[source]¶

Bases: RoPEScalingConfig

Yet-another RoPE interpolatioN (YaRN) scaling.

Reference: https://arxiv.org/abs/2309.00071

Extends a model’s context window by blending two sets of inverse frequencies:

Interpolation frequencies – the original RoPE frequencies divided by factor. These allow the model to compress positions and hence attend across a longer sequence.
Extrapolation frequencies – the unmodified RoPE frequencies the model was trained with.

A linear ramp (controlled by beta_fast / beta_slow) determines which of the two spectra dominates for each dimension so that high- frequency bands remain intact while very low frequencies are fully scaled.

Besides re-mapping the rotary angles, YaRN rescales the attention logits by attention_factor (computed via m-scale) to compensate for the larger effective context.

factor: float = 8.0¶: Context expansion multiplier. (e.g. 8× gives ≈8-times longer context length).

beta_fast: int = 32¶: Dimensional cut-off that delimits the start (high-freq) of the ramp region.

beta_slow: int = 1¶: Dimensional cut-off that delimits the end (low-freq) of the ramp region.

old_context_len: int = 8192¶: Maximum sequence length that the base model was originally trained with.

compute_scaled_inv_freq(theta, dim, device)[source]¶

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:: tuple[torch.Tensor, float]

get_attention_rescale_factor()[source]¶

Compute the attention rescale factor based on section 3.4 of the YaRN paper

Return type:: float

to_hf_config()[source]¶

YaRN scaling corresponds to HF’s yarn scaling.

Return type:: dict

class olmo_core.nn.rope.RotaryEmbeddingBase(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None, partial_rotary_factor=1.0)[source]¶

Bases: Module

Base class for RoPE implementations.

abstract warmup_cache(max_seq_len, device)[source]¶: Warmup the buffer cache.

abstract get_buffers(max_seq_len, device)[source]¶

Get the cached buffers.

Return type:: RoPEBuffers

class olmo_core.nn.rope.RotaryEmbedding(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None, partial_rotary_factor=1.0)[source]¶

Bases: RotaryEmbeddingBase

Rotary positional embeddings (RoPE).

nn.rope¶

`nn.rope`¶