nn.rope

class olmo_core.nn.rope.RoPEType(value)[source]

Bases: StrEnum

An enumeration of the different RoPE implementations.

default = 'default'

➡️ RotaryEmbedding

fused = 'fused'

➡️ FusedRotaryEmbedding

complex = 'complex'

➡️ ComplexRotaryEmbedding

class olmo_core.nn.rope.RoPEConfig(name='default', theta=500000, full_precision=True, no_global_rope=False, scaling=None)[source]

Bases: ModuleConfig

A config for conveniently building any of the different RoPE classes.

See the individual RotaryEmbedding subclasses for a description of the configuration options.

name: RoPEType = 'default'

The name of the implementation.

theta: int = 500000

The base frequency parameter for the RoPE.

full_precision: bool = True

Whether to always apply RoPE in full precision regardless of the input data type.

no_global_rope: bool = False

Whether to disable RoPE on global (non-SWA) attention layers.

scaling: Optional[RoPEScalingConfig] = None

The scaling config to apply to RoPE.

build(head_size, cache=None)[source]

Construct the corresponding RoPE class.

Parameters:

head_size (int) – The size of the attention heads.

Return type:

RotaryEmbeddingBase

class olmo_core.nn.rope.RoPEScalingConfig[source]

Bases: Config

Base class for RoPE scaling configs. Defines a strategy for scaling RoPE to longer sequences.

abstract compute_scaled_inv_freq(theta, dim, device)[source]

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:

tuple[torch.Tensor, float]

abstract to_hf_config()[source]

Convert to HuggingFace rope_scaling format.

Return type:

dict

class olmo_core.nn.rope.ABFRoPEScalingConfig(attention_rescale_factor=1.0, new_theta=8000000)[source]

Bases: RoPEScalingConfig

Absolute base frequency scaling (ABF). Simply uses a new base frequency parameter.

attention_rescale_factor: float = 1.0

Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.

compute_scaled_inv_freq(theta, dim, device)[source]

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:

tuple[torch.Tensor, float]

to_hf_config()[source]

ABF scaling doesn’t have a direct HF equivalent (just modify the config’s base frequency).

Return type:

dict

class olmo_core.nn.rope.PIRoPEScalingConfig(attention_rescale_factor=1.0, factor=2.0)[source]

Bases: RoPEScalingConfig

Position-Interpolation (PI) RoPE scaling from Chen et al. (https://arxiv.org/pdf/2306.15595)

Interpolate the rotary angles instead of extrapolating them when the context window at inference time exceeds the window used during training. In practice, this amounts to linearly compressing the original position indices by a constant factor factor.

attention_rescale_factor: float = 1.0

Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.

factor: float = 2.0

Context expansion multiplier. If factor = 1, reduces to vanilla RoPE.

compute_scaled_inv_freq(theta, dim, device)[source]

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:

tuple[torch.Tensor, float]

to_hf_config()[source]

PI scaling corresponds to HF’s linear scaling.

Return type:

dict

class olmo_core.nn.rope.StepwiseRoPEScalingConfig(attention_rescale_factor=1.0, factor=32.0, low_freq_proportion=0.0, high_freq_proportion=0.25, old_context_len=8192)[source]

Bases: RoPEScalingConfig

Step-wise RoPE scaling (aka “Per-frequency” scaling or Llama-3.1 scaling).

Reference: Llama-3.1-8B README

Scales RoPE to longer sequence lengths by interpolating between high- and low-frequency components.

  1. High-frequency band (short wavelengths) – keeps the original frequencies unchanged.

    These correspond to the very first dimensions of the rotary embedding and already encode short-range ordering well.

  2. Low-frequency band (long wavelengths) – divides the original inverse frequency by

    factor (equivalently, multiplies the wavelength by factor). This has the effect of spreading the very low frequencies across a longer context window (similar to PI scaling).

  3. Medium-frequency band – linearly interpolates (in inverse-frequency space) between the

    unscaled and the fully-scaled value so that the full spectrum changes smoothly.

attention_rescale_factor: float = 1.0

Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.

factor: float = 32.0

Context expansion multiplier applied to the long-wavelength part of the spectrum.

low_freq_proportion: float = 0.0

Proportion of the spectrum that is considered low-frequency. Is translated into a concrete wavelength that represents the upper bound of the low-frequency band.

high_freq_proportion: float = 0.25

Proportion of the spectrum that is considered high-frequency. Is translated into a concrete wavelength that represents the lower bound of the high-frequency band.

old_context_len: int = 8192

Maximum sequence length the base model was originally trained with.

compute_scaled_inv_freq(theta, dim, device)[source]

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:

tuple[torch.Tensor, float]

to_hf_config()[source]

Stepwise scaling corresponds to HF’s llama3 scaling.

Return type:

dict

class olmo_core.nn.rope.YaRNRoPEScalingConfig(factor=8.0, beta_fast=32, beta_slow=1, old_context_len=8192)[source]

Bases: RoPEScalingConfig

Yet-another RoPE interpolatioN (YaRN) scaling.

Reference: https://arxiv.org/abs/2309.00071

Extends a model’s context window by blending two sets of inverse frequencies:

  1. Interpolation frequencies – the original RoPE frequencies divided by factor. These allow the model to compress positions and hence attend across a longer sequence.

  2. Extrapolation frequencies – the unmodified RoPE frequencies the model was trained with.

A linear ramp (controlled by beta_fast / beta_slow) determines which of the two spectra dominates for each dimension so that high- frequency bands remain intact while very low frequencies are fully scaled.

Besides re-mapping the rotary angles, YaRN rescales the attention logits by attention_factor (computed via m-scale) to compensate for the larger effective context.

factor: float = 8.0

Context expansion multiplier. (e.g. 8× gives ≈8-times longer context length).

beta_fast: int = 32

Dimensional cut-off that delimits the start (high-freq) of the ramp region.

beta_slow: int = 1

Dimensional cut-off that delimits the end (low-freq) of the ramp region.

old_context_len: int = 8192

Maximum sequence length that the base model was originally trained with.

compute_scaled_inv_freq(theta, dim, device)[source]

Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.

Return type:

tuple[torch.Tensor, float]

get_attention_rescale_factor()[source]

Compute the attention rescale factor based on section 3.4 of the YaRN paper

Return type:

float

to_hf_config()[source]

YaRN scaling corresponds to HF’s yarn scaling.

Return type:

dict

class olmo_core.nn.rope.RotaryEmbeddingBase(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]

Bases: Module

Base class for RoPE implementations.

abstract warmup_cache(max_seq_len, device)[source]

Warmup the buffer cache.

abstract get_buffers(max_seq_len, device)[source]

Get the cached buffers.

Return type:

RoPEBuffers

class olmo_core.nn.rope.RotaryEmbedding(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]

Bases: RotaryEmbeddingBase

Rotary positional embeddings (RoPE).

Parameters:
  • head_size (int) – The size of the attention heads.

  • theta (int, default: 500000) – The theta base value to use.

  • full_precision (bool, default: True) – Always apply RoPE in full precision regardless of the input data type.

  • scaling (Optional[RoPEScalingConfig], default: None) – The scaling config.

warmup_cache(max_seq_len, device)[source]

Warmup the buffer cache.

get_buffers(max_seq_len, device)[source]

Get the cached buffers.

Return type:

RoPEBuffers

forward(q, k, head_first=True, start_pos=None, pos_sin=None, pos_cos=None, freqs_cis=None, cu_doc_lens=None)[source]

Apply RoPE to query (q) and key (k) matrices.

Parameters:
  • q (Tensor) – The query matrix of shape (batch_size, num_heads, seq_len, head_size) if head_first (the default) otherwise (batch_size, seq_len, num_heads, head_size).

  • k (Tensor) – The key matrix of shape (batch_size, num_kv_heads, seq_len, head_size) if head_first (the default) otherwise (batch_size, seq_len, num_kv_heads, head_size).

  • head_first (bool, default: True) – If the head dim comes before the sequence dim.

  • start_pos (Optional[int], default: None) – The absolute position of the first query token (eg for decoding where the first query token is just the most recently decoded token).

  • cu_doc_lens (Optional[Tensor], default: None) – Cumulative document lengths for intra-document RoPE in packed inputs. When supplied, each document’s tokens receive positions starting from 0 (matching per-document forwards). Mutually exclusive with start_pos.

Return type:

Tuple[Tensor, Tensor]

Returns:

The query and key matrices after RoPE has been applied.

class olmo_core.nn.rope.FusedRotaryEmbedding(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]

Bases: RotaryEmbeddingBase

A “fused” triton-based implementation of RotaryEmbedding.

Warning

This requires flash-attn to be installed.

Parameters:
  • head_size (int) – The size of the attention heads.

  • theta (int, default: 500000) – The theta base value to use.

  • full_precision (bool, default: True) – Always apply RoPE in full precision regardless of the input data type.

  • scaling (Optional[RoPEScalingConfig], default: None) – The scaling config.

warmup_cache(max_seq_len, device)[source]

Warmup the buffer cache.

get_buffers(max_seq_len, device)[source]

Get the cached buffers.

Return type:

RoPEBuffers

forward(qkv, start_pos=None, pos_sin=None, pos_cos=None, freqs_cis=None)[source]

Apply RoPE to qkv.

Warning

This operates on qkv in place unless full_precision=True and qkv is not in full precision.

Parameters:
  • qkv (Tensor) – The query, key, and value matrix of shape (batch_size, seq_len, 3, n_heads, head_size).

  • start_pos (Optional[int], default: None) – The absolute position of the first query token (eg for decoding where the first query token is just the most recently decoded token).

Return type:

Tensor

Returns:

The qkv tensor after applying RoPE, of the same shape and dtype as the input.

class olmo_core.nn.rope.ComplexRotaryEmbedding(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]

Bases: RotaryEmbeddingBase

An implementation of RoPE as a rotation in complex space.

Parameters:
  • head_size (int) – The dimensionality of the attention heads.

  • theta (int, default: 500000) – The theta base value to use.

  • full_precision (bool, default: True) – Always apply RoPE in full precision regardless of the input data type.

warmup_cache(max_seq_len, device)[source]

Warmup the buffer cache.

get_buffers(max_seq_len, device)[source]

Get the cached buffers.

Return type:

RoPEBuffers

forward(q, k, head_first=True, start_pos=None, pos_sin=None, pos_cos=None, freqs_cis=None)[source]

Apply RoPE to query (q) and key (k) matrices.

Parameters:
  • q (Tensor) – The query matrix of shape (batch_size, num_heads, seq_len, head_size) if head_first (the default) otherwise (batch_size, seq_len, num_heads, head_size).

  • k (Tensor) – The key matrix of shape (batch_size, num_kv_heads, seq_len, head_size) if head_first (the default) otherwise (batch_size, seq_len, num_kv_heads, head_size).

  • head_first (bool, default: True) – If the head dim comes before the sequence dim.

  • start_pos (Optional[int], default: None) – The absolute position of the first query token (eg for decoding where the first query token is just the most recently decoded token).

Return type:

Tuple[Tensor, Tensor]

Returns:

The query and key matrices after RoPE has been applied.