data¶
Dataset, data loaders, and config builders for use with the Trainer.
Overview¶
For text-based data you should prepare your data by writing token IDs to numpy arrays on disk, using the
Dolma toolkit for example.
Then configure and build your dataset and data loader using either the olmo_core.data.composable API or one of the
NumpyDatasetConfigBase builders
(e.g. NumpyFSLDatasetConfig) with the
NumpyDataLoaderConfig
builder. Data loaders can be passed to TrainerConfig.build().
Submodules
data.numpy_datasetNumpyDatasetBaseNumpyFSLDatasetBaseNumpyFSLDatasetNumpyFSLDatasetMixtureNumpyPaddedFSLDatasetNumpyPackedFSLDatasetNumpyInterleavedFSLDatasetVSLCurriculumVSLNaturalCurriculumVSLGrowthCurriculumVSLGrowP2CurriculumVSLGrowLinearCurriculumNumpyVSLDatasetNumpyDatasetConfigNumpyFSLDatasetConfigNumpyPaddedFSLDatasetConfigNumpyPackedFSLDatasetConfigNumpyInterleavedFSLDatasetConfigNumpyVSLDatasetConfigVSLCurriculumTypeVSLCurriculumConfig
data.source_mixturedata.collatordata.composable- Overview
- Basic Examples
- Working with numpy source files
- Ratio-based mixing
- Up-sampling or targeted repetition
- Curriculum learning
- Reference
SourceABCTokenSourceTokenSourceConfigDocumentSourceDocumentSourceConfigTokenRangeInstanceSourceInstanceSourceConfigInstanceComposableDataLoaderComposableDataLoaderConfigInMemoryTokenSourceConcatenatedTokenSourceConcatenatedTokenSourceConfigSlicedTokenSourceSplitTokenSourceConfigSamplingTokenSourceSamplingTokenSourceConfigMixingTokenSourceMixingTokenSourceConfigInMemoryDocumentSourceConcatenatedDocumentSourceConcatenatedDocumentSourceConfigSamplingDocumentSourceSamplingDocumentSourceConfigMixingDocumentSourceMixingDocumentSourceConfigNumpyDocumentSourceNumpyDocumentSourceConfigBaseNumpyDocumentSourceConfigNumpyDocumentSourceMixConfigConcatAndChunkInstanceSourceConcatAndChunkInstanceSourceConfigPackingInstanceSourcePackingInstanceSourceConfigConcatenatedInstanceSourceConcatenatedInstanceSourceConfigSlicedInstanceSourceSplitInstanceSourceConfigSamplingInstanceSourceSamplingInstanceSourceConfigMixingInstanceSourceMixingInstanceSourceConfigRandomInstanceSourceRandomInstanceSourceConfigInstanceFilterConfigLongDocStrategyShuffleStrategyMixingInstanceSourceSpecMixingInstanceSourceSpecConfigMixingTokenSourceSpecMixingTokenSourceSpecConfigMixingDocumentSourceSpecMixingDocumentSourceSpecConfigset_composable_seed()reset_composable_seed()
data.mixesdata.tokenizerdata.data_loaderdata.typesdata.utilssplit_batch()melt_batch()truncate_batch()write_document_indices()iter_document_indices()iter_document_indices_with_max_sequence_length()get_document_indices()load_array_slice()load_array_slice_into_tensor()get_document_lengths()get_cumulative_document_lengths()memmap_to_write()write_array_to_disk()bucket_documents()segment_documents_into_instances()find_end_first_consecutive_true()find_start_last_consecutive_true()group_consecutive_values()RepetitionTuplefind_periodic_sequences()chunked()SegmentTreeNodepack_documents_into_instances()attention_mask_to_cache_leftpad()