data.mixes

class olmo_core.data.mixes.DataMixBase(value)[source]

Bases: StrEnum

Base class for enumeration of data mixes.

abstract build(base_dir, tokenizer)[source]

Construct the data mix.

Parameters:
  • base_dir (str) – Where the mix is stored, e.g. “s3://ai2-llm” or “/weka/oe-training-default/ai2-llm”.

  • tokenizer (str) – The tokenizer identifier.

Return type:

Tuple[List[str], List[str]]

Returns:

A list of paths/URLs to the tokenized numpy data files in the mix and list of corresponding labels.

class olmo_core.data.mixes.DataMix(value)[source]

Bases: DataMixBase

An enumeration of data mix names.

build(base_dir, tokenizer)[source]

Construct the data mix.

Parameters:
  • base_dir (str) – Where the mix is stored, e.g. “s3://ai2-llm” or “/weka/oe-training-default/ai2-llm”.

  • tokenizer (str) – The tokenizer identifier.

Return type:

Tuple[List[str], List[str]]

Returns:

A list of paths/URLs to the tokenized numpy data files in the mix and list of corresponding labels.