data.numpy_dataset¶
- class olmo_core.data.numpy_dataset.NumpyDatasetBase(*paths, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, bos_token_id=None)[source]¶
Bases:
ABCAn abstract base class for datasets backed by numpy arrays on disk of token IDs.
In general the instances that these datasets produce are sequences of token IDs from one or more numpy arrays, sometimes with additional metadata attached. The way those instances are formed depends on the implementation details of the subclass.
Warning
When using
NumpyDatasetBaseimplementations in a distributed setting be sure that thework_diris shared among all local ranks andfs_local_rankis set accordingly. Once those fields are set you should then callprepare()in the main process before doing anything else.Tip
Use the dataset config helpers (e.g.
NumpyFSLDatasetConfig) to configure and construct datasets instead of constructing them directly.- abstract property max_sequence_length: int¶
The maximum sequence length of any instances generated by this dataset.
- property dtype: Type[uint8] | Type[uint16] | Type[uint32] | Type[uint64]¶
The numpy datatype of the arrays.
- property fingerprint_version: str¶
The version of the
fingerprint.
- property fingerprint_fields: Tuple[str, ...]¶
Extra values to include when calculating the data contents
fingerprint.
- map(func, *, max_workers=None, method='threads', _paths=None)[source]¶
Call a function on each path in the dataset, returning a list of the results, in order.
- Parameters:
func (
Callable[[Union[Path,PathLike,str],int],TypeVar(T)]) – The function to map to the paths and their indices.max_workers (
Optional[int], default:None) – The number of workers threads/processes. Set to 0 to execute synchronously in the main thread/process.method (
Literal['threads','processes'], default:'threads') – Whether to use multi-threading or multi-processing.
- Return type:
- Returns:
The results, in the same order as
paths.
- class olmo_core.data.numpy_dataset.NumpyFSLDatasetBase(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, instance_filter_config=None, label_mask_paths=None)[source]¶
Bases:
NumpyDatasetBase,Dataset[Dict[str,Any]]A base class for fixed sequence length (FSL) numpy array-backed datasets.
- class olmo_core.data.numpy_dataset.NumpyFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, max_target_sequence_length=None, instance_filter_config=None, label_mask_paths=None)[source]¶
Bases:
NumpyFSLDatasetBaseA fixed sequence length (FSL) numpy array-backed dataset.
In this implementation the token IDs from all arrays are concatenated together and then chunked into contiguous blocks of
sequence_lengthtokens to create instances. Therefore documents may be split over multiple instances.See also
Important
If the length of an array is not a multiple of
sequence_lengthormax_target_sequence_lengththe remainder of the tokens will be ignored.Important
No special tokens are added to the input IDs so it’s assumed that if you want EOS tokens between documents, for example, those will already be in the array.
- Parameters:
paths (
Union[Path,PathLike,str]) – Paths or URLs to numpy token ID arrays.sequence_length (
int) – The number of tokens to chunk together into a single instance. Generally this should correspond to your model’s maximum input length.pad_token_id (
int) – The ID of the padding token.eos_token_id (
int) – The ID of the EOS token.dtype (
Union[Type[uint8],Type[uint16],Type[uint32],Type[uint64]], default:<class 'numpy.uint16'>) – The numpy datatype of the arrays.metadata (
Union[List[Dict[str,Any]],Dict[str,Any],None], default:None) – Metadata to add to each item. This should be a dictionary or a list of dictionaries with the same number of items as there are paths.include_instance_metadata (
Optional[bool], default:None) – IfTrue(the default), each instance returned from__getitem__()will include the metadata from its source.max_target_sequence_length (
Optional[int], default:None) – Optional upper bound used when precomputing cached offsets. If you’re planning a sequence-length warm-up, set this to the final chunk size so future datasets with largersequence_lengthvalues can reuse the exact same document ordering. The current dataset still returnssequence_length-token windows; this hint simply keeps token boundaries and cache files deterministic across warm-up stages. LeaveNoneif you won’t rebuild at a larger length.
- property fingerprint_fields: Tuple[str, ...]¶
Extra values to include when calculating the data contents
fingerprint.
- property max_sequence_length: int¶
The maximum sequence length of any instances generated by this dataset.
- property offsets: Tuple[Tuple[int, int], ...]¶
Gives the global start and end instance indices for each data file in the dataset.
- class olmo_core.data.numpy_dataset.NumpyFSLDatasetMixture(*paths, path_offset_index, seed, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, max_target_sequence_length=None, instance_filter_config=None)[source]¶
Bases:
NumpyFSLDatasetA version of
NumpyFSLDatasetbuilt from a mixture of sources and their expected token ratios relative to each other. Apath_offset_indexis used to determine the number of instances to retain from a path when constructing the local indices.
- class olmo_core.data.numpy_dataset.NumpyPaddedFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, bos_token_id=None, metadata=None, include_instance_metadata=None, instance_filter_config=None, label_mask_paths=None)[source]¶
Bases:
NumpyFSLDatasetAn FSL dataset that creates a single instance from each document. The resulting instances will all have exactly
sequence_lengthtokens, using padding if needed.- property fingerprint_fields: Tuple[str, ...]¶
Extra values to include when calculating the data contents
fingerprint.
- property offsets: Tuple[Tuple[int, int], ...]¶
Gives the global start and end instance indices for each data file in the dataset.
- class olmo_core.data.numpy_dataset.NumpyPackedFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, instance_filter_config=None, label_mask_paths=None, long_doc_strategy='truncate', source_group_size=1)[source]¶
Bases:
NumpyFSLDatasetBaseAn FSL dataset that packs documents into instances using the Optimized Best-Fit Decreasing (OBFD) algorithm described in Fewer Truncations Improve Language Modeling. The resulting instances will all have exactly
sequence_lengthtokens, using padding if needed.Note
By default OBFD is applied to each source file separately since source files from the Dolma toolkit are usually large enough for OBFD to achieve very good compactness (minimal padding tokens) and so that we can parallelize the packing. However, you can pack instances from multiple consecutive source files together by setting
source_group_sizeto a value greater than 1.Tip
Although this shares much of its option plumbing with
NumpyFSLDataset, it bypasses that subclass and derives fromNumpyFSLDatasetBaseso it can provide its own packing caches, offsets, and item materialisation logic. SubclassingNumpyFSLDatasetwould require overriding nearly every behavior defined there.- property fingerprint_fields: Tuple[str, ...]¶
Extra values to include when calculating the data contents
fingerprint.
- class olmo_core.data.numpy_dataset.NumpyInterleavedFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, seed, docs_per_instance, chunks_per_doc, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, instance_filter_config=None, label_mask_paths=None, bos_token_id=None, interleaving_exempt_paths=None)[source]¶
Bases:
NumpyPaddedFSLDatasetA version of
NumpyPaddedFSLDatasetthat creates a single instance by chunking documents and interleaving these chunks. The resulting instances may be padded out tosequence_length.- property fingerprint_fields: Tuple[str, ...]¶
Extra values to include when calculating the data contents
fingerprint.
- class olmo_core.data.numpy_dataset.VSLCurriculum[source]¶
Bases:
objectBase class for variable sequence length curriculums. These determine the sampling probability of batches from each bucket throughout training with a
NumpyVSLDataset.
- class olmo_core.data.numpy_dataset.VSLNaturalCurriculum[source]¶
Bases:
VSLCurriculumImplements the natural curriculum from Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.
- class olmo_core.data.numpy_dataset.VSLGrowthCurriculum(num_cycles=8, balanced=False)[source]¶
Bases:
VSLCurriculumA base class for growth curriculums, like
VSLGrowP2CurriculumandVSLGrowLinearCurriculum.
- class olmo_core.data.numpy_dataset.VSLGrowP2Curriculum(num_cycles=8, balanced=False)[source]¶
Bases:
VSLGrowthCurriculumImplements the “Grow-P2” curriculum from Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.
- class olmo_core.data.numpy_dataset.VSLGrowLinearCurriculum(num_cycles=8, balanced=False)[source]¶
Bases:
VSLGrowthCurriculumImplements the “Grow-Linear” curriculum from Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.
- class olmo_core.data.numpy_dataset.NumpyVSLDataset(*paths, pad_token_id, eos_token_id, vocab_size, max_sequence_length, min_sequence_length=256, curriculum=None, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, instance_filter_config=None)[source]¶
Bases:
NumpyDatasetBase,Dataset[Dict[str,Any]]A variable sequence length (VSL) numpy array-backed dataset. This is used to inject a sequence length-based curriculum during training as introduced in Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum.
This dataset creates instances of token IDs with lengths that are powers of 2 between
min_sequence_length(which must be a power of 2) andmax_sequence_length(also a power a 2). Some tokens will be discarded unlessmin_sequence_lengthis 1.Important
No special tokens are added to the input IDs so it’s assumed that if you want EOS tokens between documents, for example, those will already be in the array.
- Parameters:
paths (
Union[Path,PathLike,str]) – Paths or URLs to numpy token ID arrays.pad_token_id (
int) – The ID of the padding token.eos_token_id (
int) – The ID of the EOS token.max_sequence_length (
int) – The maximum allowed sequence length. A power of 2, e.g. ‘4096’.min_sequence_length (
int, default:256) – The minimum allowed sequence length. A power of 2, e.g. ‘256’.curriculum (
Optional[VSLCurriculum], default:None) – The variable sequence length curriculum. Determines the sampling probability of batches from each bucket throughout training.dtype (
Union[Type[uint8],Type[uint16],Type[uint32],Type[uint64]], default:<class 'numpy.uint16'>) – The numpy datatype of the arrays.metadata (
Union[List[Dict[str,Any]],Dict[str,Any],None], default:None) – Metadata to add to each item. This should be a dictionary or a list of dictionaries with the same number of items as there are paths.include_instance_metadata (
Optional[bool], default:None) – IfTrue(the default), each instance returned from__getitem__()will include the metadata from its source.
- property fingerprint_fields: Tuple[str, ...]¶
Extra values to include when calculating the data contents
fingerprint.
- property max_sequence_length: int¶
The maximum sequence length of any instances generated by this dataset.
- property offsets: Tuple[Tuple[int, int], ...]¶
Gives the global start and end instance indices for each data file in the dataset.
- prepare()[source]¶
Perform any necessary preparation.
Warning
Be sure to set
work_dirproperly before calling this and only call this from the main process (not a worker process).
- __getitem__(index)[source]¶
Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.
- get_instance_lengths()[source]¶
Get a numpy memory-mapped array with the length of every instance in the dataset.
- Return type:
ndarray
- class olmo_core.data.numpy_dataset.NumpyDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False)[source]¶
-
Abstract base configuration class for numpy-based datasets.
This abstract base class provides common configuration options and utilities for creating
NumpyDatasetBasedatasets.- tokenizer: TokenizerConfig¶
The tokenizer config.
- paths: Optional[List[str]] = None¶
The paths/URLs to the numpy token ID arrays.
- mix: Optional[Union[str, DataMixBase]] = None¶
The name of a data mix (e.g.
"dolma17").
- mix_base_dir: Optional[str] = None¶
The base directory of the data mix.
- dtype: Optional[NumpyDatasetDType] = None¶
The numpy datatype of the token ID arrays.
- metadata: Optional[List[Dict[str, Any]]] = None¶
Metadata for the numpy arrays.
- include_instance_metadata: bool = True¶
Whether or not to include the
metadatain the instances returned fromNumpyDatasetBase.__getitem__().
- instance_filter_config: Optional[InstanceFilterConfig] = None¶
The instance filter config (aka the “ngram filter”) that will be applied to the dataset. This can be used to filter out instances with too many repeated token ngrams.
- source_permutation_seed: Optional[int] = None¶
Used to shuffle the source files before handing off to the dataset class.
- work_dir: Optional[str] = None¶
The dataset working directory. This is used to cache working files like shuffled indices, instance buckets, etc.
Tip
You can save a lot of time and disk space by setting this to a common directory across all of you runs.
- ignore_fingerprint_mismatch: bool = False¶
If True, ignore dataset fingerprint mismatches when loading from a checkpoint. This is used when intentionally switching to a different dataset mix.
- abstract build()[source]¶
Build and return a NumpyDatasetBase instance from this configuration.
- Return type:
- Returns:
The constructed dataset instance.
- classmethod glob(*glob_paths, **kwargs)[source]¶
Initialize a dataset config with glob paths.
Note
Globs are not expanded until
build()is called. If any of the globs don’t expand to any matches aFileNotFoundErrorerror is raised
- classmethod from_data_mix(mix, *, tokenizer, **kwargs)[source]¶
Initialize a dataset config from an official data mix.
- Parameters:
mix (
Union[str,DataMixBase]) – The data mix.tokenizer (
TokenizerConfig) – The tokenizer config.
- Return type:
TypeVar(NumpyDatasetConfigT, bound= NumpyDatasetConfig)- Returns:
A new dataset config.
- class olmo_core.data.numpy_dataset.NumpyFSLDatasetConfig(sequence_length, max_target_sequence_length=None, generate_doc_lengths=False, label_mask_paths=None, source_mixture_config=None, *, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False)[source]¶
Bases:
NumpyDatasetConfig- sequence_length: int¶
The length of a single instance. Generally this should correspond to your model’s maximum input length.
- max_target_sequence_length: Optional[int] = None¶
Optional upper bound used when precomputing cached offsets.
If you’re planning a sequence-length warm-up, set this to the final chunk size so future datasets with larger
sequence_lengthvalues can reuse the exact same document ordering. The current dataset still returnssequence_length-token windows; this hint simply keeps token boundaries and cache files deterministic across warm-up stages. LeaveNoneif you won’t rebuild at a larger length.
- generate_doc_lengths: bool = False¶
Include individual document lengths in the instances returned from
NumpyDatasetBase.__getitem__().
- label_mask_paths: Optional[List[str]] = None¶
The paths/URLs to numpy bool files indicating which tokens should be masked.
- source_mixture_config: Optional[SourceMixtureDatasetConfig] = None¶
A source mixture dataset config. If set, the dataset will be built from a mixture of sources.
- classmethod from_src_mix(src_mix, **kwargs)[source]¶
Initialize a dataset config from a custom fine-grained data mix.
- Parameters:
src_mix (
SourceMixtureDatasetConfig) – The fine-grained SourceMixtureDatasetConfig.- Return type:
- Returns:
A new dataset config.
- class olmo_core.data.numpy_dataset.NumpyPaddedFSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, sequence_length, label_mask_paths=None)[source]¶
Bases:
NumpyDatasetConfig- sequence_length: int¶
The length of a single instance. Generally this should correspond to your model’s maximum input length.
- label_mask_paths: Optional[List[str]] = None¶
The paths/URLs to numpy bool files indicating which tokens should be masked.
- class olmo_core.data.numpy_dataset.NumpyPackedFSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, sequence_length, generate_doc_lengths=False, label_mask_paths=None, long_doc_strategy='truncate', source_group_size=1)[source]¶
Bases:
NumpyDatasetConfig- sequence_length: int¶
The length of a single instance. Generally this should correspond to your model’s maximum input length.
- generate_doc_lengths: bool = False¶
Include individual document lengths in the instances returned from
NumpyDatasetBase.__getitem__().
- label_mask_paths: Optional[List[str]] = None¶
The paths/URLs to numpy bool files indicating which tokens should be masked.
- long_doc_strategy: LongDocStrategy = 'truncate'¶
The strategy to use for handling long documents.
- source_group_size: int = 1¶
The number of source npy files to process together when packing.
- class olmo_core.data.numpy_dataset.NumpyInterleavedFSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, sequence_length, docs_per_instance, chunks_per_doc, seed, label_mask_paths=None, interleaving_exempt_paths=None)[source]¶
Bases:
NumpyDatasetConfig- sequence_length: int¶
The length of a single instance. Generally this should correspond to your model’s maximum input length.
- docs_per_instance: int¶
The number of documents to include in each instance.
- chunks_per_doc: int¶
The number of chunks to include in each document.
- seed: int¶
The seed to use for the random number generator.
- label_mask_paths: Optional[List[str]] = None¶
The paths/URLs to numpy bool files indicating which tokens should be masked.
- interleaving_exempt_paths: Optional[List[str]] = None¶
The paths/URLs to numpy bool files indicating which tokens should be exempt from interleaving.
- class olmo_core.data.numpy_dataset.NumpyVSLDatasetConfig(*, tokenizer, paths=None, mix=None, mix_base_dir=None, expand_glob=False, dtype=None, metadata=None, include_instance_metadata=True, instance_filter_config=None, source_permutation_seed=None, work_dir=None, ignore_fingerprint_mismatch=False, max_sequence_length, min_sequence_length, vsl_curriculum=None)[source]¶
Bases:
NumpyDatasetConfig- max_sequence_length: int¶
The maximum sequence length. Generally this should correspond to your model’s maximum input length.
- min_sequence_length: int¶
The minimum sequence length.
- vsl_curriculum: Optional[VSLCurriculumConfig] = None¶
The VSL curriculum config.
- class olmo_core.data.numpy_dataset.VSLCurriculumType(value)[source]¶
Bases:
StrEnumAn enumeration of the different VSL curriculum implementations.
- natural = 'natural'¶
The natural curriculum ➡️
VSLNaturalCurriculum.
- grow_p2 = 'grow_p2'¶
The “Grow-P2” curriculum ➡️
VSLGrowP2Curriculum.
- grow_linear = 'grow_linear'¶
The “Grow-Linear” curriculum ➡️
VSLGrowLinearCurriculum.