data.numpy_dataset

class olmo_core.data.numpy_dataset.NumpyDatasetBase(*paths, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, bos_token_id=None)[source]

Bases: ABC

An abstract base class for datasets backed by numpy arrays on disk of token IDs.

In general the instances that these datasets produce are sequences of token IDs from one or more numpy arrays, sometimes with additional metadata attached. The way those instances are formed depends on the implementation details of the subclass.

Warning

When using NumpyDatasetBase implementations in a distributed setting be sure that the work_dir is shared among all local ranks and fs_local_rank is set accordingly. Once those fields are set you should then call prepare() in the main process before doing anything else.

Tip

Use the dataset config helpers (e.g. NumpyFSLDatasetConfig) to configure and construct datasets instead of constructing them directly.

abstract property max_sequence_length: int

The maximum sequence length of any instances generated by this dataset.

property paths: Tuple[Path | PathLike | str, ...]

Paths and/or URLs to the numpy arrays.

property file_sizes: Tuple[int, ...]

The size, in bytes, of each numpy array.

property dtype: Type[uint8] | Type[uint16] | Type[uint32] | Type[uint64]

The numpy datatype of the arrays.

property fingerprint_version: str

The version of the fingerprint.

property fingerprint_fields: Tuple[str, ...]

Extra values to include when calculating the data contents fingerprint.

property fingerprint: str

Used to compare the contents of a dataset.

property work_dir_set: bool

Check if the working directory was explicitly set.

property num_tokens: int

Get the total number of tokens in the dataset.

map(func, *, max_workers=None, method='threads', _paths=None)[source]

Call a function on each path in the dataset, returning a list of the results, in order.

Parameters:
  • func (Callable[[Union[Path, PathLike, str], int], TypeVar(T)]) – The function to map to the paths and their indices.

  • max_workers (Optional[int], default: None) – The number of workers threads/processes. Set to 0 to execute synchronously in the main thread/process.

  • method (Literal['threads', 'processes'], default: 'threads') – Whether to use multi-threading or multi-processing.

Return type:

List[TypeVar(T)]

Returns:

The results, in the same order as paths.

prepare()[source]

Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

abstract __len__()[source]

Get the number of instances in the dataset.

Return type:

int

abstract __getitem__(index)[source]

Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.

Return type:

Dict[str, Any]

class olmo_core.data.numpy_dataset.NumpyFSLDatasetBase(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, instance_filter_config=None, label_mask_paths=None)[source]

Bases: NumpyDatasetBase, Dataset[Dict[str, Any]]

A base class for fixed sequence length (FSL) numpy array-backed datasets.

property max_sequence_length: int

The maximum sequence length of any instances generated by this dataset.

class olmo_core.data.numpy_dataset.NumpyFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, max_target_sequence_length=None, instance_filter_config=None, label_mask_paths=None)[source]

Bases: NumpyFSLDatasetBase

A fixed sequence length (FSL) numpy array-backed dataset.

In this implementation the token IDs from all arrays are concatenated together and then chunked into contiguous blocks of sequence_length tokens to create instances. Therefore documents may be split over multiple instances.

Important

If the length of an array is not a multiple of sequence_length or max_target_sequence_length the remainder of the tokens will be ignored.

Important

No special tokens are added to the input IDs so it’s assumed that if you want EOS tokens between documents, for example, those will already be in the array.

Parameters:
  • paths (Union[Path, PathLike, str]) – Paths or URLs to numpy token ID arrays.

  • sequence_length (int) – The number of tokens to chunk together into a single instance. Generally this should correspond to your model’s maximum input length.

  • pad_token_id (int) – The ID of the padding token.

  • eos_token_id (int) – The ID of the EOS token.

  • dtype (Union[Type[uint8], Type[uint16], Type[uint32], Type[uint64]], default: <class 'numpy.uint16'>) – The numpy datatype of the arrays.

  • metadata (Union[List[Dict[str, Any]], Dict[str, Any], None], default: None) – Metadata to add to each item. This should be a dictionary or a list of dictionaries with the same number of items as there are paths.

  • include_instance_metadata (Optional[bool], default: None) – If True (the default), each instance returned from __getitem__() will include the metadata from its source.

  • max_target_sequence_length (Optional[int], default: None) – Optional upper bound used when precomputing cached offsets. If you’re planning a sequence-length warm-up, set this to the final chunk size so future datasets with larger sequence_length values can reuse the exact same document ordering. The current dataset still returns sequence_length-token windows; this hint simply keeps token boundaries and cache files deterministic across warm-up stages. Leave None if you won’t rebuild at a larger length.

property fingerprint_fields: Tuple[str, ...]

Extra values to include when calculating the data contents fingerprint.

property num_tokens: int

Get the total number of tokens in the dataset.

property max_sequence_length: int

The maximum sequence length of any instances generated by this dataset.

property file_sizes: Tuple[int, ...]

The size, in bytes, of each numpy array.

property offsets: Tuple[Tuple[int, int], ...]

Gives the global start and end instance indices for each data file in the dataset.

prepare()[source]

Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

__len__()[source]

Get the number of instances in the dataset.

Return type:

int

__getitem__(index)[source]

Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.

Return type:

Dict[str, Any]

class olmo_core.data.numpy_dataset.NumpyFSLDatasetMixture(*paths, path_offset_index, seed, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, max_target_sequence_length=None, instance_filter_config=None)[source]

Bases: NumpyFSLDataset

A version of NumpyFSLDataset built from a mixture of sources and their expected token ratios relative to each other. A path_offset_index is used to determine the number of instances to retain from a path when constructing the local indices.

prepare()[source]

Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

class olmo_core.data.numpy_dataset.NumpyPaddedFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, bos_token_id=None, metadata=None, include_instance_metadata=None, instance_filter_config=None, label_mask_paths=None)[source]

Bases: NumpyFSLDataset

An FSL dataset that creates a single instance from each document. The resulting instances will all have exactly sequence_length tokens, using padding if needed.

property fingerprint_fields: Tuple[str, ...]

Extra values to include when calculating the data contents fingerprint.

property offsets: Tuple[Tuple[int, int], ...]

Gives the global start and end instance indices for each data file in the dataset.

prepare()[source]

Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

__getitem__(index)[source]

Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.

Return type:

Dict[str, Any]

class olmo_core.data.numpy_dataset.NumpyPackedFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, instance_filter_config=None, label_mask_paths=None, long_doc_strategy='truncate', source_group_size=1)[source]

Bases: NumpyFSLDatasetBase

An FSL dataset that packs documents into instances using the Optimized Best-Fit Decreasing (OBFD) algorithm described in Fewer Truncations Improve Language Modeling. The resulting instances will all have exactly sequence_length tokens, using padding if needed.

Note

By default OBFD is applied to each source file separately since source files from the Dolma toolkit are usually large enough for OBFD to achieve very good compactness (minimal padding tokens) and so that we can parallelize the packing. However, you can pack instances from multiple consecutive source files together by setting source_group_size to a value greater than 1.

Tip

Although this shares much of its option plumbing with NumpyFSLDataset, it bypasses that subclass and derives from NumpyFSLDatasetBase so it can provide its own packing caches, offsets, and item materialisation logic. Subclassing NumpyFSLDataset would require overriding nearly every behavior defined there.

property fingerprint_fields: Tuple[str, ...]

Extra values to include when calculating the data contents fingerprint.

prepare()[source]

Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

__len__()[source]

Get the number of instances in the dataset.

Return type:

int

__getitem__(index)[source]

Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.

Return type:

Dict[str, Any]

class olmo_core.data.numpy_dataset.NumpyInterleavedFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, seed, docs_per_instance, chunks_per_doc, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, instance_filter_config=None, label_mask_paths=None, bos_token_id=None, interleaving_exempt_paths=None)[source]

Bases: NumpyPaddedFSLDataset

A version of NumpyPaddedFSLDataset that creates a single instance by chunking documents and interleaving these chunks. The resulting instances may be padded out to sequence_length.

property fingerprint_fields: Tuple[str, ...]

Extra values to include when calculating the data contents fingerprint.

__len__()[source]

Get the number of instances in the dataset.

Return type:

int

prepare()[source]

Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

__getitem__(index)[source]

Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.

Return type:

Dict[str, Any]

class olmo_core.data.numpy_dataset.VSLCurriculum[source]

Bases: object

Base class for variable sequence length curriculums. These determine the sampling probability of batches from each bucket throughout training with a NumpyVSLDataset.

abstract property short_str: str

Return a unique human-readable identifier for the instance.

class olmo_core.data.numpy_dataset.VSLNaturalCurriculum[source]

Bases: VSLCurriculum

Implements the natural curriculum from Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.

property short_str: str

Return a unique human-readable identifier for the instance.

class olmo_core.data.numpy_dataset.VSLGrowthCurriculum(num_cycles=8, balanced=False)[source]

Bases: VSLCurriculum

A base class for growth curriculums, like VSLGrowP2Curriculum and VSLGrowLinearCurriculum.

num_cycles: int = 8

The number of cycles in the curriculum.

balanced: bool = False

Whether or not to balance the number of batches in each bucket.

Note

Balancing the number of batches requires dropping more data.

class olmo_core.data.numpy_dataset.VSLGrowP2Curriculum(num_cycles=8, balanced=False)[source]

Bases: VSLGrowthCurriculum

Implements the “Grow-P2” curriculum from Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.

property short_str: str

Return a unique human-readable identifier for the instance.

class olmo_core.data.numpy_dataset.VSLGrowLinearCurriculum(num_cycles=8, balanced=False)[source]

Bases: VSLGrowthCurriculum

Implements the “Grow-Linear” curriculum from Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.

property short_str: str

Return a unique human-readable identifier for the instance.

class olmo_core.data.numpy_dataset.NumpyVSLDataset(*paths, pad_token_id, eos_token_id, vocab_size, max_sequence_length, min_sequence_length=256, curriculum=None, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, instance_filter_config=None)[source]

Bases: NumpyDatasetBase, Dataset[Dict[str, Any]]

A variable sequence length (VSL) numpy array-backed dataset. This is used to inject a sequence length-based curriculum during training as introduced in Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.

This dataset creates instances of token IDs with lengths that are powers of 2 between min_sequence_length (which must be a power of 2) and max_sequence_length (also a power a 2). Some tokens will be discarded unless min_sequence_length is 1.

Important

No special tokens are added to the input IDs so it’s assumed that if you want EOS tokens between documents, for example, those will already be in the array.

Parameters:
  • paths (Union[Path, PathLike, str]) – Paths or URLs to numpy token ID arrays.

  • pad_token_id (int) – The ID of the padding token.

  • eos_token_id (int) – The ID of the EOS token.

  • max_sequence_length (int) – The maximum allowed sequence length. A power of 2, e.g. ‘4096’.

  • min_sequence_length (int, default: 256) – The minimum allowed sequence length. A power of 2, e.g. ‘256’.

  • curriculum (Optional[VSLCurriculum], default: None) – The variable sequence length curriculum. Determines the sampling probability of batches from each bucket throughout training.

  • dtype (Union[Type[uint8], Type[uint16], Type[uint32], Type[uint64]], default: <class 'numpy.uint16'>) – The numpy datatype of the arrays.

  • metadata (Union[List[Dict[str, Any]], Dict[str, Any], None], default: None) – Metadata to add to each item. This should be a dictionary or a list of dictionaries with the same number of items as there are paths.

  • include_instance_metadata (Optional[bool], default: None) – If True (the default), each instance returned from __getitem__() will include the metadata from its source.

property fingerprint_fields: Tuple[str, ...]

Extra values to include when calculating the data contents fingerprint.

property max_sequence_length: int

The maximum sequence length of any instances generated by this dataset.

property offsets: Tuple[Tuple[int, int], ...]

Gives the global start and end instance indices for each data file in the dataset.

prepare()[source]

Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

__len__()[source]

Get the number of instances in the dataset.

__getitem__(index)[source]

Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.

Return type:

Dict[str, Any]

get_instance_lengths()[source]

Get a numpy memory-mapped array with the length of every instance in the dataset.

Return type:

ndarray

get_instance_bucket(seq_len)[source]

Get the instance indices in a bucket.

Return type:

ndarray

get_instance_buckets()[source]

Get the buckets of instance indices that all have the same length. The buckets will be sorted from smallest sequence length to longest.

Return type:

List[Tuple[int, ndarray]]

property instances_per_bucket: Tuple[Tuple[int, int], ...]

The number of instances in each bucket.

class olmo_core.data.numpy_dataset.NumpyDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False)[source]

Bases: Config, ABC

Abstract base configuration class for numpy-based datasets.

This abstract base class provides common configuration options and utilities for creating NumpyDatasetBase datasets.

tokenizer: TokenizerConfig

The tokenizer config.

paths: Optional[List[str]] = None

The paths/URLs to the numpy token ID arrays.

mix: Optional[Union[str, DataMixBase]] = None

The name of a data mix (e.g. "dolma17").

mix_base_dir: Optional[str] = None

The base directory of the data mix.

expand_glob: bool = False

If True, treat the paths as globs.

dtype: Optional[NumpyDatasetDType] = None

The numpy datatype of the token ID arrays.

metadata: Optional[List[Dict[str, Any]]] = None

Metadata for the numpy arrays.

include_instance_metadata: bool = True

Whether or not to include the metadata in the instances returned from NumpyDatasetBase.__getitem__().

instance_filter_config: Optional[InstanceFilterConfig] = None

The instance filter config (aka the “ngram filter”) that will be applied to the dataset. This can be used to filter out instances with too many repeated token ngrams.

source_permutation_seed: Optional[int] = None

Used to shuffle the source files before handing off to the dataset class.

work_dir: Optional[str] = None

The dataset working directory. This is used to cache working files like shuffled indices, instance buckets, etc.

Tip

You can save a lot of time and disk space by setting this to a common directory across all of you runs.

ignore_fingerprint_mismatch: bool = False

If True, ignore dataset fingerprint mismatches when loading from a checkpoint. This is used when intentionally switching to a different dataset mix.

abstract build()[source]

Build and return a NumpyDatasetBase instance from this configuration.

Return type:

NumpyDatasetBase

Returns:

The constructed dataset instance.

classmethod glob(*glob_paths, **kwargs)[source]

Initialize a dataset config with glob paths.

Note

Globs are not expanded until build() is called. If any of the globs don’t expand to any matches a FileNotFoundError error is raised

Parameters:

glob_paths (str) – The glob patterns.

Return type:

TypeVar(NumpyDatasetConfigT, bound= NumpyDatasetConfig)

Returns:

A new dataset config.

classmethod from_data_mix(mix, *, tokenizer, **kwargs)[source]

Initialize a dataset config from an official data mix.

Parameters:
Return type:

TypeVar(NumpyDatasetConfigT, bound= NumpyDatasetConfig)

Returns:

A new dataset config.

class olmo_core.data.numpy_dataset.NumpyFSLDatasetConfig(sequence_length, max_target_sequence_length=None, generate_doc_lengths=False, label_mask_paths=None, source_mixture_config=None, *, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False)[source]

Bases: NumpyDatasetConfig

sequence_length: int

The length of a single instance. Generally this should correspond to your model’s maximum input length.

max_target_sequence_length: Optional[int] = None

Optional upper bound used when precomputing cached offsets.

If you’re planning a sequence-length warm-up, set this to the final chunk size so future datasets with larger sequence_length values can reuse the exact same document ordering. The current dataset still returns sequence_length-token windows; this hint simply keeps token boundaries and cache files deterministic across warm-up stages. Leave None if you won’t rebuild at a larger length.

generate_doc_lengths: bool = False

Include individual document lengths in the instances returned from NumpyDatasetBase.__getitem__().

label_mask_paths: Optional[List[str]] = None

The paths/URLs to numpy bool files indicating which tokens should be masked.

source_mixture_config: Optional[SourceMixtureDatasetConfig] = None

A source mixture dataset config. If set, the dataset will be built from a mixture of sources.

classmethod from_src_mix(src_mix, **kwargs)[source]

Initialize a dataset config from a custom fine-grained data mix.

Parameters:

src_mix (SourceMixtureDatasetConfig) – The fine-grained SourceMixtureDatasetConfig.

Return type:

NumpyFSLDatasetConfig

Returns:

A new dataset config.

validate()[source]

Validate fields in self. This may modify self in-place.

build()[source]

Build and return a NumpyDatasetBase instance from this configuration.

Return type:

NumpyDatasetBase

Returns:

The constructed dataset instance.

class olmo_core.data.numpy_dataset.NumpyPaddedFSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, sequence_length, label_mask_paths=None)[source]

Bases: NumpyDatasetConfig

sequence_length: int

The length of a single instance. Generally this should correspond to your model’s maximum input length.

label_mask_paths: Optional[List[str]] = None

The paths/URLs to numpy bool files indicating which tokens should be masked.

validate()[source]

Validate fields in self. This may modify self in-place.

build()[source]

Build and return a NumpyDatasetBase instance from this configuration.

Return type:

NumpyDatasetBase

Returns:

The constructed dataset instance.

class olmo_core.data.numpy_dataset.NumpyPackedFSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, sequence_length, generate_doc_lengths=False, label_mask_paths=None, long_doc_strategy='truncate', source_group_size=1)[source]

Bases: NumpyDatasetConfig

sequence_length: int

The length of a single instance. Generally this should correspond to your model’s maximum input length.

generate_doc_lengths: bool = False

Include individual document lengths in the instances returned from NumpyDatasetBase.__getitem__().

label_mask_paths: Optional[List[str]] = None

The paths/URLs to numpy bool files indicating which tokens should be masked.

long_doc_strategy: LongDocStrategy = 'truncate'

The strategy to use for handling long documents.

source_group_size: int = 1

The number of source npy files to process together when packing.

validate()[source]

Validate fields in self. This may modify self in-place.

build()[source]

Build and return a NumpyDatasetBase instance from this configuration.

Return type:

NumpyDatasetBase

Returns:

The constructed dataset instance.

class olmo_core.data.numpy_dataset.NumpyInterleavedFSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, sequence_length, docs_per_instance, chunks_per_doc, seed, label_mask_paths=None, interleaving_exempt_paths=None)[source]

Bases: NumpyDatasetConfig

sequence_length: int

The length of a single instance. Generally this should correspond to your model’s maximum input length.

docs_per_instance: int

The number of documents to include in each instance.

chunks_per_doc: int

The number of chunks to include in each document.

seed: int

The seed to use for the random number generator.

label_mask_paths: Optional[List[str]] = None

The paths/URLs to numpy bool files indicating which tokens should be masked.

interleaving_exempt_paths: Optional[List[str]] = None

The paths/URLs to numpy bool files indicating which tokens should be exempt from interleaving.

validate()[source]

Validate fields in self. This may modify self in-place.

build()[source]

Build and return a NumpyDatasetBase instance from this configuration.

Return type:

NumpyDatasetBase

Returns:

The constructed dataset instance.

class olmo_core.data.numpy_dataset.NumpyVSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, max_sequence_length, min_sequence_length, vsl_curriculum=None)[source]

Bases: NumpyDatasetConfig

max_sequence_length: int

The maximum sequence length. Generally this should correspond to your model’s maximum input length.

min_sequence_length: int

The minimum sequence length.

vsl_curriculum: Optional[VSLCurriculumConfig] = None

The VSL curriculum config.

validate()[source]

Validate fields in self. This may modify self in-place.

build()[source]

Build and return a NumpyDatasetBase instance from this configuration.

Return type:

NumpyDatasetBase

Returns:

The constructed dataset instance.

class olmo_core.data.numpy_dataset.VSLCurriculumType(value)[source]

Bases: StrEnum

An enumeration of the different VSL curriculum implementations.

natural = 'natural'

The natural curriculum ➡️ VSLNaturalCurriculum.

grow_p2 = 'grow_p2'

The “Grow-P2” curriculum ➡️ VSLGrowP2Curriculum.

grow_linear = 'grow_linear'

The “Grow-Linear” curriculum ➡️ VSLGrowLinearCurriculum.

class olmo_core.data.numpy_dataset.VSLCurriculumConfig(name='natural', num_cycles=None, balanced=None)[source]

Bases: Config

validate()[source]

Validate fields in self. This may modify self in-place.

build()[source]

Build the VSL curriculum.

Return type:

VSLCurriculum