data.source_mixture

class olmo_core.data.source_mixture.SourceMixtureConfig(source_name, target_ratio, paths, max_repetition_ratio=1.0, max_source_fraction=1.0, _resolved_paths=None)[source]

Bases: Config

Configuration for a single data source within a mixture.

This class defines how a data source should be sampled and weighted when creating a training dataset from multiple sources. It allows control over the target proportion, repetition limits, and maximum usage fraction of the source data.

source_name: str

The name of the source.

target_ratio: float

The target ratio of the source in the mixture.

paths: List[str]

A list of paths to the source data.

max_repetition_ratio: float = 1.0

The maximum ratio of repetitions of the source data to include in the mixture. This can be used to upsample the source data by setting the repetition ratio > 1.

max_source_fraction: float = 1.0

The maximum ratio of the source data to include in the mixture.

validate()[source]

Validate fields in self. This may modify self in-place.

property resolved_paths: List[str]

Resolve the paths, expanding any globs and validating existence. Caches the result after the first access.

class olmo_core.data.source_mixture.SourceMixtureList(sources)[source]

Bases: Config

A list of source configurations for building a mixture dataset. This class ensures that the target ratios of the sources sum to 1.0.

The purpose of this class is to make managing sources independent from the details of materializing those sources with SourceMixtureDatasetConfig.build().

With this separation, we can define a list of sources in a YAML file without also needing to specify parameters like requested_tokens, global_batch_size, or processes.

validate()[source]

Validate fields in self. This may modify self in-place.

class olmo_core.data.source_mixture.SourceMixtureDatasetConfig(source_list, requested_tokens, global_batch_size, processes=1, seed=42, render_tables=True, quiet=False)[source]

Bases: Config

Configuration for building a dataset from a fractionalized mixture of sources.

This class manages the creation of training datasets by combining multiple data sources according to specified target ratios. It handles token counting, source selection, and ensures the final mixture meets the requested dataset size while maintaining the desired proportions across sources.

The build process will: 1. Count available tokens in each source 2. Calculate token allocations based on target ratios 3. Validate that sources have sufficient data 4. Generate a mixture that respects repetition and fraction limits

source_list: SourceMixtureList

A list of source configurations contained in a SourceMixtureList.

requested_tokens: int

The desired dataset size, in tokens. This is used to determine the number of tokens to select from each source. The total dataset size will be greater than or equal to this value, depending on rounding.

global_batch_size: int

The global batch size for training, in tokens. Used to determine the total number of requested instances.

processes: int = 1

The number of processes to use for counting tokens in parallel.

seed: int = 42

The seed used to generate the dataset. Specifically this seed is used when sampling the actual instances to use from each source.

render_tables: bool = True

Whether to render tables of the mixture outcome.

validate()[source]

Validate fields in self. This may modify self in-place.

get_paths_and_tokens_for_source(source_config, token_details, npdtype)[source]

Get the paths and resulting token count for a source.

Return type:

List[SourcePathTokens]

render_mixture_outcome_tables(results)[source]

Render tables enumerating the global and per-source mixture outcomes.

Return type:

None

table_to_text(table)[source]

Generate an ascii formatted presentation of a Rich table Eliminates column styling

Return type:

Text