data.collator¶
- class olmo_core.data.collator.DataCollator(pad_token_id, pad_direction='right', label_ignore_index=-100, vocab_size=None)[source]¶
Bases:
objectThe default data collator used by
TextDataLoaderBasesubclasses.- Parameters:
pad_token_id (
int) – The token ID to use for padding.pad_direction (
PaddingDirection, default:'right') – The direction to pad instances.label_ignore_index (
int, default:-100) – The index to use for ignored labels.vocab_size (
Optional[int], default:None) – If set, validate that all token IDs in the collated batch are in[0, vocab_size). This catches out-of-range IDs early with a clear error message, which is especially useful when usingtorch.compilewhere the resulting CUDA error would otherwise be opaque.