nn.functional

Common nn function implementations.

olmo_core.nn.functional.cross_entropy_loss(logits, labels, *, ignore_index=-100, reduction='mean', compute_z_loss=False, z_loss_multiplier=0.0001)[source]

Cross entropy loss that optionally computes the softmax auxiliary loss (z-loss) as well.

Parameters:
  • logits (Tensor) – Predicted unnormalized logits with shape (N, vocab_size).

  • labels (Tensor) – Ground truth class indices with shape (N,).

  • ignore_index (int, default: -100) – Specifies a target value that is ignored and does not contribute to the input gradient.

  • reduction (Literal['mean', 'sum', 'none'], default: 'mean') – Specifies the reduction to apply to the output. Can be “none”, “mean”, or “sum”.

  • compute_z_loss (bool, default: False) – Compute the softmax auxiliary loss as well.

  • z_loss_multiplier (float, default: 0.0001) – The multiplier to apply to the z-loss.

Return type:

Tuple[Tensor, Optional[Tensor]]

Returns:

The cross entropy loss and optionally the z-loss.

olmo_core.nn.functional.fused_linear_cross_entropy_loss(_input, weight, labels, *, bias=None, ignore_index=-100, reduction='mean', compute_z_loss=False, z_loss_multiplier=0.0001, ce_weight=None, label_smoothing=0.0, softcap=None, accum_dtype=None)[source]

Cross entropy loss fused with the linear layer that computes the logits, which avoids materialization of the large logits tensor. Additionally, this function computes gradients during the forward pass, (valid when CrossEntropyLoss comes last), so _input and labels do not need to be stored for the backwards pass.

Parameters:
  • _input (Tensor) – The inputs to pass through the linear layer to produce the logits (N, D).

  • weight (Tensor) – The weight of the linear layer.

  • labels (Tensor) – Ground truth class indices with shape (N,).

  • bias (Optional[Tensor], default: None) – Optional bias for the linear layer.

  • ignore_index (int, default: -100) – Specifies a target value that is ignored and does not contribute to the input gradient.

  • reduction (Literal['mean', 'sum', 'none'], default: 'mean') – Specifies the reduction to apply to the output. Can be “none”, “mean”, or “sum”.

  • compute_z_loss (bool, default: False) – Compute the softmax auxiliary loss as well.

  • z_loss_multiplier (float, default: 0.0001) – The multiplier to apply to the z-loss.

  • accum_dtype (Optional[dtype], default: None) – The dtype of intermediate result buffers for weight and bias gradient accumulations. Recommended to set accum_dtype to higher precision, e.g. torch.float32, if the training is unstable with original dtype. Default to performing accumulations in original dtype.

Return type:

Tuple[Tensor, Optional[Tensor]]

Returns:

The cross entropy loss and optionally the z-loss.