data

Dataset, data loaders, and config builders for use with the Trainer.

Overview

For text-based data you should prepare your data by writing token IDs to numpy arrays on disk, using the Dolma toolkit for example. Then configure and build your dataset and data loader using either the olmo_core.data.composable API or one of the NumpyDatasetConfigBase builders (e.g. NumpyFSLDatasetConfig) with the NumpyDataLoaderConfig builder. Data loaders can be passed to TrainerConfig.build().

Submodules