data.types

class olmo_core.data.types.LongDocStrategy(value)[source]

Bases: StrEnum

Specifies how to handle documents that are longer than the max sequence length when packing.

truncate = 'truncate'

Long docs are truncated and the excess tokens are discarded.

fragment = 'fragment'

Long docs are split into smaller docs so that no tokens are discarded, but you end up with fragmented docs.

class olmo_core.data.types.NumpyDatasetDType(value)[source]

Bases: StrEnum

Supported numpy unsigned integer data types for datasets.

as_np_dtype()[source]

Convert the enum value to its corresponding numpy dtype.

Return type:

Union[Type[uint8], Type[uint16], Type[uint32], Type[uint64]]

Returns:

The numpy unsigned integer dtype corresponding to this enum value.