data.source_mixture¶
- class olmo_core.data.source_mixture.SourceMixtureConfig(source_name, target_ratio, paths, max_repetition_ratio=1.0, max_source_fraction=1.0, _resolved_paths=None)[source]¶
Bases:
ConfigConfiguration for a single data source within a mixture.
This class defines how a data source should be sampled and weighted when creating a training dataset from multiple sources. It allows control over the target proportion, repetition limits, and maximum usage fraction of the source data.
- class olmo_core.data.source_mixture.SourceMixtureList(sources)[source]¶
Bases:
ConfigA list of source configurations for building a mixture dataset. This class ensures that the target ratios of the sources sum to 1.0.
The purpose of this class is to make managing sources independent from the details of materializing those sources with SourceMixtureDatasetConfig.build().
With this separation, we can define a list of sources in a YAML file without also needing to specify parameters like requested_tokens, global_batch_size, or processes.
- class olmo_core.data.source_mixture.SourceMixtureDatasetConfig(source_list, requested_tokens, global_batch_size, processes=1, seed=42, render_tables=True, quiet=False)[source]¶
Bases:
ConfigConfiguration for building a dataset from a fractionalized mixture of sources.
This class manages the creation of training datasets by combining multiple data sources according to specified target ratios. It handles token counting, source selection, and ensures the final mixture meets the requested dataset size while maintaining the desired proportions across sources.
The build process will: 1. Count available tokens in each source 2. Calculate token allocations based on target ratios 3. Validate that sources have sufficient data 4. Generate a mixture that respects repetition and fraction limits
-
source_list:
SourceMixtureList¶ A list of source configurations contained in a SourceMixtureList.
-
requested_tokens:
int¶ The desired dataset size, in tokens. This is used to determine the number of tokens to select from each source. The total dataset size will be greater than or equal to this value, depending on rounding.
-
global_batch_size:
int¶ The global batch size for training, in tokens. Used to determine the total number of requested instances.
-
seed:
int= 42¶ The seed used to generate the dataset. Specifically this seed is used when sampling the actual instances to use from each source.
- get_paths_and_tokens_for_source(source_config, token_details, npdtype)[source]¶
Get the paths and resulting token count for a source.
- Return type:
List[SourcePathTokens]
-
source_list: