data.collator

class olmo_core.data.collator.DataCollator(pad_token_id, pad_direction='right', label_ignore_index=-100, vocab_size=None)[source]

Bases: object

The default data collator used by TextDataLoaderBase subclasses.

Parameters:
  • pad_token_id (int) – The token ID to use for padding.

  • pad_direction (PaddingDirection, default: 'right') – The direction to pad instances.

  • label_ignore_index (int, default: -100) – The index to use for ignored labels.

  • vocab_size (Optional[int], default: None) – If set, validate that all token IDs in the collated batch are in [0, vocab_size). This catches out-of-range IDs early with a clear error message, which is especially useful when using torch.compile where the resulting CUDA error would otherwise be opaque.

__call__(items)[source]

Create a batch from a sequence of instances.

Return type:

Dict[str, Any]