`data.numpy_dataset`¶

class olmo_core.data.numpy_dataset.NumpyDatasetBase(*paths, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, bos_token_id=None)[source]¶

Bases: ABC

An abstract base class for datasets backed by numpy arrays on disk of token IDs.

In general the instances that these datasets produce are sequences of token IDs from one or more numpy arrays, sometimes with additional metadata attached. The way those instances are formed depends on the implementation details of the subclass.

Warning

When using NumpyDatasetBase implementations in a distributed setting be sure that the work_dir is shared among all local ranks and fs_local_rank is set accordingly. Once those fields are set you should then call prepare() in the main process before doing anything else.

Tip

Use the dataset config helpers (e.g. NumpyFSLDatasetConfig) to configure and construct datasets instead of constructing them directly.

abstract property max_sequence_length: int¶: The maximum sequence length of any instances generated by this dataset.

property paths: Tuple[Path | PathLike | str, ...]¶: Paths and/or URLs to the numpy arrays.

property file_sizes: Tuple[int, ...]¶: The size, in bytes, of each numpy array.

property dtype: Type[uint8] | Type[uint16] | Type[uint32] | Type[uint64]¶: The numpy datatype of the arrays.

property fingerprint_version: str¶: The version of the fingerprint.

property fingerprint_fields: Tuple[str, ...]¶: Extra values to include when calculating the data contents fingerprint.

property fingerprint: str¶: Used to compare the contents of a dataset.

property work_dir_set: bool¶: Check if the working directory was explicitly set.

property num_tokens: int¶: Get the total number of tokens in the dataset.

map(func, *, max_workers=None, method='threads', _paths=None)[source]¶

Call a function on each path in the dataset, returning a list of the results, in order.

Parameters:

func (Callable[[Union[Path, PathLike, str], int], TypeVar(T)]) – The function to map to the paths and their indices.
max_workers (Optional[int], default: None) – The number of workers threads/processes. Set to 0 to execute synchronously in the main thread/process.
method (Literal['threads', 'processes'], default: 'threads') – Whether to use multi-threading or multi-processing.

Return type:

List[TypeVar(T)]

Returns:

The results, in the same order as paths.

prepare()[source]¶: Perform any necessary preparation.

Warning

Be sure to set work_dir properly before calling this and only call this from the main process (not a worker process).

abstract __len__()[source]¶

Get the number of instances in the dataset.

Return type:: int

abstract __getitem__(index)[source]¶

Get an instance from the dataset. At a minimum this will contain the field “input_ids”, a integer tensor of token IDs.

Return type:: Dict[str, Any]

class olmo_core.data.numpy_dataset.NumpyFSLDatasetBase(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, instance_filter_config=None, label_mask_paths=None)[source]¶

Bases: NumpyDatasetBase, Dataset[Dict[str, Any]]

A base class for fixed sequence length (FSL) numpy array-backed datasets.

property max_sequence_length: int¶: The maximum sequence length of any instances generated by this dataset.

class olmo_core.data.numpy_dataset.NumpyFSLDataset(*paths, sequence_length, pad_token_id, eos_token_id, vocab_size, dtype=<class 'numpy.uint16'>, metadata=None, include_instance_metadata=None, generate_doc_lengths=False, bos_token_id=None, max_target_sequence_length=None, instance_filter_config=None, label_mask_paths=None)[source]¶

Bases: NumpyFSLDatasetBase

A fixed sequence length (FSL) numpy array-backed dataset.

In this implementation the token IDs from all arrays are concatenated together and then chunked into contiguous blocks of sequence_length tokens to create instances. Therefore documents may be split over multiple instances.

data.numpy_dataset¶

`data.numpy_dataset`¶