nn.rope¶
- class olmo_core.nn.rope.RoPEType(value)[source]¶
Bases:
StrEnumAn enumeration of the different RoPE implementations.
- default = 'default'¶
- fused = 'fused'¶
- complex = 'complex'¶
- class olmo_core.nn.rope.RoPEConfig(name='default', theta=500000, full_precision=True, no_global_rope=False, scaling=None)[source]¶
Bases:
ModuleConfigA config for conveniently building any of the different RoPE classes.
See the individual
RotaryEmbeddingsubclasses for a description of the configuration options.-
full_precision:
bool= True¶ Whether to always apply RoPE in full precision regardless of the input data type.
-
scaling:
Optional[RoPEScalingConfig] = None¶ The scaling config to apply to RoPE.
-
full_precision:
- class olmo_core.nn.rope.RoPEScalingConfig[source]¶
Bases:
ConfigBase class for RoPE scaling configs. Defines a strategy for scaling RoPE to longer sequences.
- class olmo_core.nn.rope.ABFRoPEScalingConfig(attention_rescale_factor=1.0, new_theta=8000000)[source]¶
Bases:
RoPEScalingConfigAbsolute base frequency scaling (ABF). Simply uses a new base frequency parameter.
-
attention_rescale_factor:
float= 1.0¶ Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.
-
attention_rescale_factor:
- class olmo_core.nn.rope.PIRoPEScalingConfig(attention_rescale_factor=1.0, factor=2.0)[source]¶
Bases:
RoPEScalingConfigPosition-Interpolation (PI) RoPE scaling from Chen et al. (https://arxiv.org/pdf/2306.15595)
Interpolate the rotary angles instead of extrapolating them when the context window at inference time exceeds the window used during training. In practice, this amounts to linearly compressing the original position indices by a constant factor
factor.-
attention_rescale_factor:
float= 1.0¶ Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.
-
attention_rescale_factor:
- class olmo_core.nn.rope.StepwiseRoPEScalingConfig(attention_rescale_factor=1.0, factor=32.0, low_freq_proportion=0.0, high_freq_proportion=0.25, old_context_len=8192)[source]¶
Bases:
RoPEScalingConfigStep-wise RoPE scaling (aka “Per-frequency” scaling or Llama-3.1 scaling).
Reference: Llama-3.1-8B README
Scales RoPE to longer sequence lengths by interpolating between high- and low-frequency components.
- High-frequency band (short wavelengths) – keeps the original frequencies unchanged.
These correspond to the very first dimensions of the rotary embedding and already encode short-range ordering well.
- Low-frequency band (long wavelengths) – divides the original inverse frequency by
factor(equivalently, multiplies the wavelength byfactor). This has the effect of spreading the very low frequencies across a longer context window (similar to PI scaling).
- Medium-frequency band – linearly interpolates (in inverse-frequency space) between the
unscaled and the fully-scaled value so that the full spectrum changes smoothly.
-
attention_rescale_factor:
float= 1.0¶ Factor to rescale attention scores by when using scaled RoPE. Can be used to compensate for the larger effective context. 1.0 means no rescaling.
-
factor:
float= 32.0¶ Context expansion multiplier applied to the long-wavelength part of the spectrum.
-
low_freq_proportion:
float= 0.0¶ Proportion of the spectrum that is considered low-frequency. Is translated into a concrete wavelength that represents the upper bound of the low-frequency band.
-
high_freq_proportion:
float= 0.25¶ Proportion of the spectrum that is considered high-frequency. Is translated into a concrete wavelength that represents the lower bound of the high-frequency band.
- class olmo_core.nn.rope.YaRNRoPEScalingConfig(factor=8.0, beta_fast=32, beta_slow=1, old_context_len=8192)[source]¶
Bases:
RoPEScalingConfigYet-another RoPE interpolatioN (YaRN) scaling.
Reference: https://arxiv.org/abs/2309.00071
Extends a model’s context window by blending two sets of inverse frequencies:
Interpolation frequencies – the original RoPE frequencies divided by
factor. These allow the model to compress positions and hence attend across a longer sequence.Extrapolation frequencies – the unmodified RoPE frequencies the model was trained with.
A linear ramp (controlled by
beta_fast/beta_slow) determines which of the two spectra dominates for each dimension so that high- frequency bands remain intact while very low frequencies are fully scaled.Besides re-mapping the rotary angles, YaRN rescales the attention logits by
attention_factor(computed via m-scale) to compensate for the larger effective context.-
old_context_len:
int= 8192¶ Maximum sequence length that the base model was originally trained with.
- compute_scaled_inv_freq(theta, dim, device)[source]¶
Compute the scaled inverse frequencies for RoPE, and the attention rescaling factor.
- class olmo_core.nn.rope.RotaryEmbeddingBase(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]¶
Bases:
ModuleBase class for RoPE implementations.
- class olmo_core.nn.rope.RotaryEmbedding(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]¶
Bases:
RotaryEmbeddingBaseRotary positional embeddings (RoPE).
- Parameters:
head_size (
int) – The size of the attention heads.theta (
int, default:500000) – The theta base value to use.full_precision (
bool, default:True) – Always apply RoPE in full precision regardless of the input data type.scaling (
Optional[RoPEScalingConfig], default:None) – The scaling config.
- forward(q, k, head_first=True, start_pos=None, pos_sin=None, pos_cos=None, freqs_cis=None, cu_doc_lens=None)[source]¶
Apply RoPE to query (
q) and key (k) matrices.- Parameters:
q (
Tensor) – The query matrix of shape(batch_size, num_heads, seq_len, head_size)ifhead_first(the default) otherwise(batch_size, seq_len, num_heads, head_size).k (
Tensor) – The key matrix of shape(batch_size, num_kv_heads, seq_len, head_size)ifhead_first(the default) otherwise(batch_size, seq_len, num_kv_heads, head_size).head_first (
bool, default:True) – If the head dim comes before the sequence dim.start_pos (
Optional[int], default:None) – The absolute position of the first query token (eg for decoding where the first query token is just the most recently decoded token).cu_doc_lens (
Optional[Tensor], default:None) – Cumulative document lengths for intra-document RoPE in packed inputs. When supplied, each document’s tokens receive positions starting from 0 (matching per-document forwards). Mutually exclusive withstart_pos.
- Return type:
- Returns:
The query and key matrices after RoPE has been applied.
- class olmo_core.nn.rope.FusedRotaryEmbedding(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]¶
Bases:
RotaryEmbeddingBaseA “fused” triton-based implementation of
RotaryEmbedding.Warning
This requires flash-attn to be installed.
- Parameters:
head_size (
int) – The size of the attention heads.theta (
int, default:500000) – The theta base value to use.full_precision (
bool, default:True) – Always apply RoPE in full precision regardless of the input data type.scaling (
Optional[RoPEScalingConfig], default:None) – The scaling config.
- forward(qkv, start_pos=None, pos_sin=None, pos_cos=None, freqs_cis=None)[source]¶
Apply RoPE to
qkv.Warning
This operates on
qkvin place unlessfull_precision=Trueandqkvis not in full precision.- Parameters:
- Return type:
- Returns:
The qkv tensor after applying RoPE, of the same shape and dtype as the input.
- class olmo_core.nn.rope.ComplexRotaryEmbedding(*, head_size, theta=500000, full_precision=True, cache=None, scaling=None)[source]¶
Bases:
RotaryEmbeddingBaseAn implementation of RoPE as a rotation in complex space.
- Parameters:
- forward(q, k, head_first=True, start_pos=None, pos_sin=None, pos_cos=None, freqs_cis=None)[source]¶
Apply RoPE to query (
q) and key (k) matrices.- Parameters:
q (
Tensor) – The query matrix of shape(batch_size, num_heads, seq_len, head_size)ifhead_first(the default) otherwise(batch_size, seq_len, num_heads, head_size).k (
Tensor) – The key matrix of shape(batch_size, num_kv_heads, seq_len, head_size)ifhead_first(the default) otherwise(batch_size, seq_len, num_kv_heads, head_size).head_first (
bool, default:True) – If the head dim comes before the sequence dim.start_pos (
Optional[int], default:None) – The absolute position of the first query token (eg for decoding where the first query token is just the most recently decoded token).
- Return type:
- Returns:
The query and key matrices after RoPE has been applied.