data.composable

Overview

A composable data loading API for fixed sequence length text data.

┌─────────────┐       ┌────────────────┐       ┌──────────────────────┐
│ TokenSource │ ⇢ ⋯ ⇢ │ InstanceSource │ ⇢ ⋯ ⇢ │ ComposableDataLoader │
└─────────────┘       └────────────────┘       └──────────────────────┘

This API consists of a series of simple, composable, elements, including:

  1. TokenSource / DocumentSource: Token sources provide access to tokenized text data, while document sources are special token sources that also provide information on where the document boundaries are. Examples include:

  2. InstanceSource: Instance sources convert token sources (or in some case other instance sources) into fixed-length instances. Examples include:

  3. ComposableDataLoader: A data loader for OLMo-core’s Trainer that takes one or more instance sources.

Tip

Use InstanceSource.visualize() to print out a recursive visualization of an instance source and all its sub-sources.

Basic Examples

Create a simple instance source that chunks up in-memory token sources:

from olmo_core.data.composable import *

work_dir = "/tmp/dataset-common"
source = ConcatAndChunkInstanceSource(
    InMemoryTokenSource(list(range(100)), work_dir=work_dir),
    sequence_length=10,
    work_dir=work_dir,
)
source.visualize()
ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
└─ InMemoryTokenSource(73b91ee): 100 tokens

Split the source into train and test sets:

train_source, test_source = source.split(0.8)
train_source.visualize()
test_source.visualize()
SlicedInstanceSource(d01d0e2): 80 tokens
└─ ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
   └─ InMemoryTokenSource(73b91ee): 100 tokens

SlicedInstanceSource(a5a511f): 20 tokens
└─ ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
   └─ InMemoryTokenSource(73b91ee): 100 tokens

Sample a subset of a source:

train_source = train_source.sample(max_tokens=50)
train_source.visualize()
SamplingInstanceSource(77d8031): 50 tokens
└─ SlicedInstanceSource(d01d0e2): 80 tokens
   └─ ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
      └─ InMemoryTokenSource(73b91ee): 100 tokens

Create a mix of token sources:

tokens1 = InMemoryTokenSource(list(range(100)), work_dir=work_dir, label="source1")
tokens2 = InMemoryTokenSource(list(range(100, 200)), work_dir=work_dir, label="source2")
tokens_mix = MixingTokenSource(
    MixingTokenSource.Spec(source=tokens1, ratio=0.5),
    MixingTokenSource.Spec(source=tokens2, ratio=0.5),
    work_dir=work_dir,
)
source = ConcatAndChunkInstanceSource(tokens_mix, sequence_length=10, work_dir=work_dir)
source.visualize()
ConcatAndChunkInstanceSource(4820826): 200 tokens
└─ MixingTokenSource(5fc211a): 200 tokens
   ├─ SamplingTokenSource(7adca21): 100 tokens [source1]
   │  └─ InMemoryTokenSource(73b91ee): 100 tokens [source1]
   └─ SamplingTokenSource(baf2e4f): 100 tokens [source2]
      └─ InMemoryTokenSource(a9e49e1): 100 tokens [source2]

Working with numpy source files

You can use the same numpy tokenized source files that the dataset classes in olmo_core.data.numpy_dataset consume by starting with the NumpyDocumentSource or its corresponding config class, NumpyDocumentSourceConfig.

For example:

source_config = NumpyDocumentSource.Config(
    source_paths=[
        "gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy"
    ],
    tokenizer=tokenizer,
)
sources = source_config.build(work_dir)
[
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-00-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-01-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-02-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-03-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-04-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-05-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-06-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-07-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-08-00000.npy',),
    NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-09-00000.npy',)
]

Ratio-based mixing

Here’s a more useful example where we create several groups of numpy sources, two code and two math, using NumpyDocumentSourceConfig.from_source_groups():

sequence_length = 8192
token_sources = NumpyDocumentSource.Config.from_source_groups(
    {
        "code_fim": [
            "gs://ai2-llm/preprocessed/stack-edu/sample-fim-weighted-pl-edu-score-decon/**/**/*.npy",
        ],
        "swallowcode": [
            "gs://ai2-llm/preprocessed/tokyotech-llm/swallowcode/scor_final_data-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy"
        ],
        "megamath": [
            "gs://ai2-llm/preprocessed/megamath_web_pro_max/beaker_rewrites-decon-sparkle-motion/**/allenai/dolma2-tokenizer/*.npy"
        ],
        "dolminos2math": [
            "gs://ai2-llm/preprocessed/tokyotech-llm/swallowmath/beaker_outputs-decon-sparkle-motion-withids/allenai/dolma2-tokenizer/*.npy",
            "gs://ai2-llm/preprocessed/midtraining-reasoning/flat_dolmino_math-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
            "gs://ai2-llm/preprocessed/midtraining-reasoning/OpenMathReasoning/OpenMathReasoning-rewrite-full-thoughts/jsonls-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
            "gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/MIND/data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
            "gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
        ],
    },
    tokenizer=tokenizer,
)

And then mix them together at the instance level in a hierarchical fashion with a MixingInstanceSource to get a source with 30B tokens:

def make_instance_source(label: str) -> InstanceSourceConfig:
    return ConcatAndChunkInstanceSource.Config(
        sources=[token_sources[label]], label=label, sequence_length=sequence_length
    )


mix_config = MixingInstanceSource.Config(
    source_specs=[
        ################
        # code sources #
        ################
        MixingInstanceSource.Spec.Config(
            source=MixingInstanceSource.Config(
                source_specs=[
                    MixingInstanceSource.Spec.Config(
                        source=make_instance_source("code_fim"),
                        ratio=0.5,
                        label="code_fim",
                    ),
                    MixingInstanceSource.Spec.Config(
                        source=make_instance_source("swallowcode"),
                        ratio=0.5,
                        label="swallowcode",
                    ),
                ]
            ),
            ratio=0.5,
            label="code",
        ),
        ################
        # math sources #
        ################
        MixingInstanceSource.Spec.Config(
            source=MixingInstanceSource.Config(
                source_specs=[
                    MixingInstanceSource.Spec.Config(
                        source=make_instance_source("megamath"),
                        ratio=0.1,
                        label="megamath",
                    ),
                    MixingInstanceSource.Spec.Config(
                        source=make_instance_source("dolminos2math"),
                        ratio=0.9,
                        label="dolminos2math",
                    ),
                ]
            ),
            ratio=0.5,
            label="math",
        ),
    ],
    num_tokens=30_000_000_000,
)

mix = mix_config.build("/tmp/dataset-common")
mix.visualize()
MixingInstanceSource(e421147): 30.0B tokens
├─ SamplingInstanceSource(c65cde2): 15.0B tokens [code]
│  └─ MixingInstanceSource(73dfe43): 37.7B tokens
│     ├─ SamplingInstanceSource(85521bf): 18.8B tokens [code_fim]
│     │  └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
│     │     └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
│     └─ SamplingInstanceSource(8d7c840): 18.8B tokens [swallowcode]
│        └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
│           └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
└─ SamplingInstanceSource(03941ca): 15.0B tokens [math]
   └─ MixingInstanceSource(39aa7de): 20.3B tokens
      ├─ SamplingInstanceSource(cbc20a2): 2.0B tokens [megamath]
      │  └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
      │     └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
      └─ SamplingInstanceSource(857de5e): 18.3B tokens [dolminos2math]
         └─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
            └─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]

Tip

The ratios (e.g. MixingInstanceSourceSpec.ratio) for each source within a mix don’t necessary need to sum to 1.0, but you’ll see a warning if they don’t and they’ll be normalized before being applied:

UserWarning: Target mixing ratios don't sum to 1. They will be normalized as follows:
 ❯ Source 'math': target ratio adjusted from 0.7 to 0.7368421052631579
 ❯ Source 'code': target ratio adjusted from 0.25 to 0.2631578947368421

Up-sampling or targeted repetition

Suppose we wanted to simulate training 3 epochs on the mixture above, i.e. training on 3 repetitions of the data. In general you can do exact up-sampling by wrapping a source in a SamplingInstanceSource (or SamplingTokenSource, SamplingDocumentSource), or by calling the .sample() / .resize() methods:

upsampled_mix = mix.resize(3.0)
upsampled_mix.visualize()
SamplingInstanceSource(de59d5e): 90.0B tokens
└─ MixingInstanceSource(e421147): 30.0B tokens
   ├─ SamplingInstanceSource(c65cde2): 15.0B tokens [code]
   │  └─ MixingInstanceSource(73dfe43): 37.7B tokens
   │     ├─ SamplingInstanceSource(85521bf): 18.8B tokens [code_fim]
   │     │  └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
   │     │     └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
   │     └─ SamplingInstanceSource(8d7c840): 18.8B tokens [swallowcode]
   │        └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
   │           └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
   └─ SamplingInstanceSource(03941ca): 15.0B tokens [math]
      └─ MixingInstanceSource(39aa7de): 20.3B tokens
         ├─ SamplingInstanceSource(cbc20a2): 2.0B tokens [megamath]
         │  └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
         │     └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
         └─ SamplingInstanceSource(857de5e): 18.3B tokens [dolminos2math]
            └─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
               └─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]

Curriculum learning

The composable API also enables curriculum learning. Suppose we want the first half of training to focus on 25% code + 75% math, and the second half to focus on 75% code + 25% math.

We’ll start by randomly splitting each of our sources, and since we’ll want to set RNG seeds in multiple places, we’ll use the helper function set_composable_seed() to set the global starting seed so that we don’t have to set a different seed explicitly everywhere one is required:

set_composable_seed(42)

instance_sources = {
    "code_fim": make_instance_source("code_fim").random_split(0.25),
    "swallowcode": make_instance_source("swallowcode").random_split(0.25),
    "megamath": make_instance_source("megamath").random_split(0.75),
    "dolminos2math": make_instance_source("dolminos2math").random_split(0.75),
}

And then we can create two separate mixes with the splits:

def make_source_spec(label: str, split: int, ratio: float) -> MixingInstanceSourceSpecConfig:
    return MixingInstanceSource.Spec.Config(
        source=instance_sources[label][split],
        ratio=ratio,
        label=label,
    )

mix_config1 = MixingInstanceSource.Config(
    source_specs=[
        MixingInstanceSource.Spec.Config(
            source=MixingInstanceSource.Config(
                source_specs=[
                    make_source_spec("code_fim", 0, 0.5),
                    make_source_spec("swallowcode", 0, 0.5),
                ]
            ),
            ratio=0.25,
            label="code",
        ),
        MixingInstanceSource.Spec.Config(
            source=MixingInstanceSource.Config(
                source_specs=[
                    make_source_spec("megamath", 0, 0.1),
                    make_source_spec("dolminos2math", 0, 0.9),
                ]
            ),
            ratio=0.75,
            label="math",
        ),
    ],
)

mix_config2 = MixingInstanceSource.Config(
    source_specs=[
        MixingInstanceSource.Spec.Config(
            source=MixingInstanceSource.Config(
                source_specs=[
                    make_source_spec("code_fim", 1, 0.5),
                    make_source_spec("swallowcode", 1, 0.5),
                ]
            ),
            ratio=0.75,
            label="code",
        ),
        MixingInstanceSource.Spec.Config(
            source=MixingInstanceSource.Config(
                source_specs=[
                    make_source_spec("megamath", 1, 0.1),
                    make_source_spec("dolminos2math", 1, 0.9),
                ]
            ),
            ratio=0.25,
            label="math",
        ),
    ],
)

mix1 = mix_config1.build("/tmp/dataset-common")
mix1.visualize()
mix2 = mix_config2.build("/tmp/dataset-common")
mix2.visualize()
MixingInstanceSource(6544ac1): 20.3B tokens
├─ SamplingInstanceSource(d28e959): 5.1B tokens [code]
│  └─ MixingInstanceSource(08c8aa6): 9.4B tokens
│     ├─ SamplingInstanceSource(c8e7179): 4.7B tokens [code_fim]
│     │  └─ SlicedInstanceSource(02637e6): 5.3B tokens [code_fim]
│     │     └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
│     │        └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
│     └─ SamplingInstanceSource(520fff5): 4.7B tokens [swallowcode]
│        └─ SlicedInstanceSource(fcaebbc): 4.7B tokens [swallowcode]
│           └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
│              └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
└─ SamplingInstanceSource(b33b3f1): 15.2B tokens [math]
   └─ MixingInstanceSource(47322cf): 15.2B tokens
      ├─ SamplingInstanceSource(ccadff0): 1.5B tokens [megamath]
      │  └─ SlicedInstanceSource(c4cd38d): 2.9B tokens [megamath]
      │     └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
      │        └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
      └─ SamplingInstanceSource(476779e): 13.7B tokens [dolminos2math]
         └─ SlicedInstanceSource(75c18b6): 13.7B tokens [dolminos2math]
            └─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
               └─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]

MixingInstanceSource(02f0d21): 20.3B tokens
├─ SamplingInstanceSource(76da23b): 15.2B tokens [code]
│  └─ MixingInstanceSource(cb963a6): 28.3B tokens
│     ├─ SamplingInstanceSource(5c8e643): 14.1B tokens [code_fim]
│     │  └─ SlicedInstanceSource(f0ca032): 16.0B tokens [code_fim]
│     │     └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
│     │        └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
│     └─ SamplingInstanceSource(79187df): 14.1B tokens [swallowcode]
│        └─ SlicedInstanceSource(0ae4650): 14.1B tokens [swallowcode]
│           └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
│              └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
└─ SamplingInstanceSource(8e32820): 5.1B tokens [math]
   └─ MixingInstanceSource(25a8bb7): 5.1B tokens
      ├─ SamplingInstanceSource(70d1102): 507.6M tokens [megamath]
      │  └─ SlicedInstanceSource(d3ca4bc): 970.6M tokens [megamath]
      │     └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
      │        └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
      └─ SamplingInstanceSource(afd6a12): 4.6B tokens [dolminos2math]
         └─ SlicedInstanceSource(b53996f): 4.6B tokens [dolminos2math]
            └─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
               └─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]

When we build our ComposableDataLoader we’ll pass it both of those mixes, in order, and specify the ShuffleStrategy as intra_source so that each mix is shuffled independently during its phase of training:

data_loader = ComposableDataLoader.Config(
    tokenizer=tokenizer,
    global_batch_size=512 * sequence_length,
    shuffle_strategy=ShuffleStrategy.intra_source,
).build(mix1, mix2, work_dir="/tmp/dataloader-common")

Alternatively you could set sources_per_epoch=1 to tell the data loader to use only the first source for the first epoch, the second source for the second epoch, and so on:

data_loader = ComposableDataLoader.Config(
    tokenizer=tokenizer,
    global_batch_size=512 * sequence_length,
    sources_per_epoch=1,
).build(mix1, mix2, work_dir="/tmp/dataloader-common")

Reference

class olmo_core.data.composable.SourceABC(*, work_dir, label=None)[source]

Bases: object

Abstract base class for source types.

Parameters:
  • work_dir (Union[Path, PathLike, str]) – A common local working directory that can be used for caching files during preprocessing.

  • label (Optional[str], default: None) – An optional label for this source, useful for debugging and visualizing.

property common_work_dir: Path

The common working directory, usually the parent of work_dir.

property work_dir: Path

The class-specific local working directory that can be used by the source for caching files during preprocessing.

property fs_local_rank: int

The local rank of the current process with respect to filesystem access of the working directory.

property rank: int

The global rank of the current process across the entire distributed job.

property label: str | None

The label assigned to this source.

abstract property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

abstract property num_tokens: int

The number of tokens available from this source.

abstract children()[source]

Get the child sources that make up this source, if any.

Return type:

Iterable[SourceABC]

property is_leaf: bool

Check if this source is a leaf node (i.e. has no children).

class olmo_core.data.composable.TokenSource(*, work_dir, label=None)[source]

Bases: SourceABC

An abstract base class for a source of tokens, usually consumed by an InstanceSource. It essentially represents an array of tokens.

At a minimum, a TokenSource must implement the methods/properties (1) num_tokens(), (2) get_token_range(), (3) fingerprint(), and (4) children().

__len__()[source]

The number of tokens available from this source, same as self.num_tokens.

Return type:

int

abstract get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

__getitem__(key)[source]

Get a range of tokens using either an integer index (for a singular token range) or a slice.

Return type:

TokenRange

__add__(other)[source]

Add two token sources together into a ConcatenatedTokenSource or ConcatenatedDocumentSource depending on the type of self and other.

Return type:

ConcatenatedTokenSource

__mul__(factor)[source]

Re-size this source by a given factor by sampling tokens from it.

Return type:

SamplingTokenSource

sample(*, max_tokens, seed=0)[source]

Sample a contiguous chunk of tokens from this source.

See also

resize()

Parameters:
  • max_tokens (int) – The maximum number of tokens to sample.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingTokenSource

resize(factor, seed=0)[source]

Re-size this source by a given factor by sampling a contiguous chunk of tokens from it.

See also

sample()

Parameters:
  • factor (float) – The factor to resize the source by. For example, 0.5 will create a source with half the number of tokens, and 2.0 will create a source with twice the number of tokens.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingTokenSource

split(ratio)[source]

Split this source into two disjoint sources according to the given ratio.

Parameters:

ratio (float) – The ratio of the first split to original source. E.g., 0.8 means the first split will have 80% of the tokens and the second split will have 20%.

Return type:

Tuple[SlicedTokenSource, SlicedTokenSource]

class olmo_core.data.composable.TokenSourceConfig[source]

Bases: Config

A base config class for configuring and building a TokenSource.

abstract build(work_dir)[source]

Build the token source.

Return type:

List[TokenSource]

__add__(other)[source]

Add two token source config together into a ConcatenatedTokenSourceConfig or ConcatenatedDocumentSourceConfig depending on the type of self and other.

Return type:

TokenSourceConfig

__mul__(factor)[source]

Re-size this source by a given factor by sampling tokens from it.

Return type:

SamplingTokenSourceConfig

sample(*, max_tokens, seed=0)[source]

Sample a contiguous chunk of tokens from this source.

Parameters:
  • max_tokens (int) – The maximum number of tokens to sample.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingTokenSourceConfig

resize(factor, seed=0)[source]

Re-size this source by a given factor by sampling a contiguous chunk of tokens from it.

Parameters:
  • factor (float) – The factor to resize the source by. For example, 0.5 will create a source with half the number of tokens, and 2.0 will create a source with twice the number of tokens.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingTokenSourceConfig

split(ratio)[source]

Split this source into two disjoint sources according to the given ratio.

Parameters:

ratio (float) – The ratio of the first split to original source. E.g., 0.8 means the first split will have 80% of the tokens and the second split will have 20%.

Return type:

Tuple[SplitTokenSourceConfig, SplitTokenSourceConfig]

class olmo_core.data.composable.DocumentSource(*, work_dir, label=None)[source]

Bases: TokenSource

An abstract base class for a particular type of TokenSource that’s aware of document boundaries. This class has one additional abstract method: get_document_offsets().

sample_by_docs(*, max_tokens, seed=0)[source]

Sample documents from this source.

Parameters:
  • max_tokens (int) – The maximum number of tokens to sample.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingDocumentSource

resize_by_docs(factor, seed=0)[source]

Re-size this source by a given factor by sampling documents from it.

Parameters:
  • factor (float) – The factor to resize the source by. For example, 0.5 will create a source with half the number of tokens, and 2.0 will create a source with twice the number of tokens.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingDocumentSource

abstract get_document_offsets()[source]

Get the start (inclusive) and end (exclusive) token indices of each document, in order.

Return type:

Iterable[tuple[int, int]]

class olmo_core.data.composable.DocumentSourceConfig[source]

Bases: TokenSourceConfig

A base config class for configuring and building a DocumentSource.

abstract build(work_dir)[source]

Build the document source.

Return type:

List[DocumentSource]

sample_by_docs(*, max_tokens, seed=0)[source]

Sample documents from this source.

Parameters:
  • max_tokens (int) – The maximum number of tokens to sample.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingDocumentSourceConfig

resize_by_docs(factor, seed=0)[source]

Re-size this source by a given factor by sampling documents from it.

Parameters:
  • factor (float) – The factor to resize the source by. For example, 0.5 will create a source with half the number of tokens, and 2.0 will create a source with twice the number of tokens.

  • seed (Optional[int], default: 0) – A seed to use to randomize the sampling.

Return type:

SamplingDocumentSourceConfig

class olmo_core.data.composable.TokenRange[source]

Bases: TypedDict

A token range is just a dictionary that should include input_ids of the range and optionally a corresponding label_mask.

input_ids: Sequence[int]

The token IDs for the range.

label_mask: NotRequired[Sequence[bool]]

An optional mask indicating which tokens should contribute to the loss.

class olmo_core.data.composable.InstanceSource(*, work_dir, sequence_length, max_sequence_length=None, label=None)[source]

Bases: SourceABC

An abstract base class for a source of instances, usually consumed by a ComposableDataLoader. It essentially represents an array of instances, where each instance is a sequence of sequence_length tokens.

Parameters:
  • sequence_length (int) – The length of each sequence (instance) to produce.

  • max_sequence_length (Optional[int], default: None) – For sources that support this. If you intend to increase the sequence length in the middle of an epoch, you should set this to the maximum sequence length that you’ll train on to guarantee that you can restart the run with the same data order after changing sequence length. Care needs to be taken when implementing this in a subclass to ensure that the exact same tokens will be produced when sequence_length is changed but max_sequence_length is fixed.

property sequence_length: int

The sequence length of each instance that this source will produce.

property max_sequence_length: int

Typically the same as sequence_length though in some cases it can be greater, such as when the sequence length will be increased in the middle of an epoch.

property num_tokens: int

The number of tokens available from this source.

abstract __len__()[source]

The number of instances available from this source.

Return type:

int

abstract __getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

__iter__()[source]

Iterate over all instances in the source.

Return type:

Generator[Instance, None, None]

__add__(other)[source]

Add two instance sources together into a ConcatenatedInstanceSource.

Return type:

ConcatenatedInstanceSource

__mul__(factor)[source]

Re-size this source by a given factor by sampling instances from it.

Return type:

SamplingInstanceSource

sample(*, max_tokens=None, max_instances=None, seed=0)[source]

Sample instances from this source.

See also

Parameters:
  • max_tokens (Optional[int], default: None) – The maximum number of tokens to sample from this source. Mutually exclusive with max_instances.

  • max_instances (Optional[int], default: None) – The maximum number of instances to sample from this source. Mutually exclusive with max_tokens.

  • seed (Optional[int], default: 0) – A random seed for sampling. If None, no shuffling is done and instances are taken in order.

Return type:

SamplingInstanceSource

resize(factor, seed=0)[source]

Re-size this source by a given factor by sampling instances from it.

See also

Parameters:
  • factor (float) – The factor by which to resize this source.

  • seed (Optional[int], default: 0) – A random seed for sampling.

Return type:

SamplingInstanceSource

split(ratio, seed=None)[source]

Split this source into two disjoint sources according to the given ratio.

Parameters:
  • ratio (float) – The ratio of the first split to original source. E.g., 0.8 means the first split will have 80% of the instances and the second split will have 20%.

  • seed (Optional[int], default: None) – A seed to use to randomize the split.

Return type:

Tuple[SlicedInstanceSource, SlicedInstanceSource]

random_split(ratio, seed=0)[source]

Like split() but always a random split.

Return type:

Tuple[SlicedInstanceSource, SlicedInstanceSource]

visualize(icons=True)[source]

Print a visualization of this source and its children, recursively.

Parameters:

icons (bool, default: True) –

Whether to use icons in the visualization.

Important

Some icons used in the visualization require a Nerd Font to render properly.

class olmo_core.data.composable.InstanceSourceConfig[source]

Bases: Config

A base config class for configuring and building an InstanceSource.

abstract build(work_dir)[source]

Build the InstanceSource.

Return type:

InstanceSource

__add__(other)[source]

Add two instance source configs together into a ConcatenatedInstanceSourceConfig.

Return type:

ConcatenatedInstanceSourceConfig

__mul__(factor)[source]

Re-size this source by a given factor by sampling instances from it.

Return type:

SamplingInstanceSourceConfig

sample(*, max_tokens=None, max_instances=None, seed=0)[source]

Sample instances from this source.

Parameters:
  • max_tokens (Optional[int], default: None) – The maximum number of tokens to sample from this source. Mutually exclusive with max_instances.

  • max_instances (Optional[int], default: None) – The maximum number of instances to sample from this source. Mutually exclusive with max_tokens.

  • seed (Optional[int], default: 0) – A random seed for sampling. If None, no shuffling is done and instances are taken in order.

Return type:

SamplingInstanceSourceConfig

resize(factor, seed=0)[source]

Re-size this source by a given factor by sampling instances from it.

Parameters:
  • factor (float) – The factor by which to resize this source.

  • seed (Optional[int], default: 0) – A random seed for sampling.

Return type:

SamplingInstanceSourceConfig

split(ratio, seed=None)[source]

Split this source into two disjoint sources according to the given ratio.

Parameters:
  • ratio (float) – The ratio of the first split to original source. E.g., 0.8 means the first split will have 80% of the instances and the second split will have 20%.

  • seed (Optional[int], default: None) – A seed to use to randomize the split.

Return type:

Tuple[SplitInstanceSourceConfig, SplitInstanceSourceConfig]

random_split(ratio, seed=0)[source]

Like split() but always a random split.

Return type:

Tuple[SplitInstanceSourceConfig, SplitInstanceSourceConfig]

class olmo_core.data.composable.Instance[source]

Bases: TypedDict

An instance is just a dictionary that should include input_ids and optionally a corresponding label_mask.

input_ids: Sequence[int]

The token IDs for this instance.

label_mask: NotRequired[Sequence[bool]]

An optional mask indicating which tokens should contribute to the loss.

class olmo_core.data.composable.ComposableDataLoader(*sources, collator, tokenizer, work_dir, global_batch_size, dp_world_size=1, dp_rank=0, fs_local_rank=None, seed=0, shuffle=True, shuffle_strategy=None, sources_per_epoch=-1, num_threads=None, num_workers=0, prefetch_factor=None, target_device_type='cpu', generate_doc_lengths=False, instance_filter_config=None, display_source_visualization=True)[source]

Bases: TextDataLoaderBase

A data loader for composable instance sources.

Parameters:
  • sources (InstanceSource) – One or more instance sources to draw data from. All sources must have the same sequence_length and max_sequence_length.

  • collator (DataCollator) – The data collator to use to form batches.

  • tokenizer (TokenizerConfig) – The config of the tokenizer used to create the underlying data.

  • work_dir (Union[Path, PathLike, str]) – A common local working directory that can be used for caching.

  • global_batch_size (int) – The total batch size (in tokens) across all data parallel ranks.

  • dp_world_size (int, default: 1) – The number of data parallel ranks.

  • dp_rank (int, default: 0) – The data parallel rank of the current process.

  • fs_local_rank (Optional[int], default: None) – The local rank of the current process with respect to filesystem access of the working directory.

  • seed (int, default: 0) – The random seed to use when shuffling data.

  • shuffle (bool, default: True) – Whether to shuffle data at the start of each epoch.

  • shuffle_strategy (Optional[ShuffleStrategy], default: None) – How to shuffle the data. Defaults to ShuffleStrategy.inter_source.

  • sources_per_epoch (int, default: -1) – The number of sources to use per epoch. If -1, all sources are used.

  • num_threads (Optional[int], default: None) – The number of threads to use for loading data within each worker process.

  • num_workers (int, default: 0) – The number of worker processes to use for loading data.

  • prefetch_factor (Optional[int], default: None) – The number of batches to prefetch from each worker process.

  • target_device_type (str, default: 'cpu') – The type of device that batches will be sent to, typically either “cpu” or “cuda”.

  • generate_doc_lengths (bool, default: False) – Whether to generate document lengths for each instance needed for intra-document masking.

  • instance_filter_config (Optional[InstanceFilterConfig], default: None) – Optional configuration for filtering instances based on long sequences of repeated ngrams.

  • display_source_visualization (bool, default: True) – Whether to display a visualization of each source to stdout from rank 0.

Config

alias of ComposableDataLoaderConfig

property total_batches: int | None

The total number of batches that the dataset will produce over the course of the current epoch, if known. Otherwise this should return None.

batches_in_epoch(epoch)[source]

By default this is the same as total_batches(), though some data loaders might generate a different number of batches per epoch.

Return type:

Optional[int]

state_dict()[source]

Get a state dictionary for checkpointing.

Return type:

Dict[str, Any]

load_state_dict(state_dict)[source]

Load a state dict from state_dict() to restore the data loader’s state.

reshuffle(epoch=None, **kwargs)[source]

Reshuffle for a new epoch. Should be called before starting the epoch, regardless of whether or not you’ve called load_state_dict().

Parameters:

epoch (Optional[int], default: None) – The epoch number.

get_mock_batch()[source]

Return a batch with arbitrary data. This can just be random data as it’s only used by the trainer to do a dry-run of the forward and backward pass before training officially starts.

Return type:

Dict[str, Any]

class olmo_core.data.composable.ComposableDataLoaderConfig(tokenizer=None, global_batch_size=None, seed=<factory>, work_dir=None, shuffle=True, shuffle_strategy=None, sources_per_epoch=-1, num_threads=None, num_workers=0, prefetch_factor=None, target_device_type=None, generate_doc_lengths=False, instance_filter_config=None, display_source_visualization=True, *, type=None)[source]

Bases: DataLoaderConfig[ComposableDataLoader]

A configuration class for building ComposableDataLoader data loaders.

build(*sources, collator=None, work_dir=None, mesh=None, dp_process_group=None, tokenizer=None, global_batch_size=None)[source]

Construct the ComposableDataLoader.

Parameters:
  • sources (InstanceSource) – The instance sources.

  • collator (Optional[DataCollator], default: None) – An optional data collator. If not provided, a default will be created.

  • work_dir (Union[Path, PathLike, str, None], default: None) – A working directory for caching.

  • mesh (Optional[DeviceMesh], default: None) – An optional DeviceMesh that defines the data parallel dimensions. Ideally you should create this mesh using build_world_mesh(). Alternatively you can pass the dp_process_group instead.

  • dp_process_group (Optional[ProcessGroup], default: None) – The data parallel process group.

Return type:

ComposableDataLoader

registered_base

alias of DataLoaderConfig

class olmo_core.data.composable.InMemoryTokenSource(tokens, *, work_dir, label_mask=None, label=None)[source]

Bases: TokenSource

An in-memory implementation of a TokenSource. Primarily meant for testing.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

property num_tokens: int

The number of tokens available from this source.

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.ConcatenatedTokenSource(*sources, work_dir, label=None)[source]

Bases: TokenSource

A token source that can be created from concatenating multiple other token sources.

Config

alias of ConcatenatedTokenSourceConfig

children()[source]

Get the child sources that make up this source, if any.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

property num_tokens: int

The number of tokens available from this source.

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

class olmo_core.data.composable.ConcatenatedTokenSourceConfig(sources, label=None)[source]

Bases: TokenSourceConfig

A base config class for configuring and building a ConcatenatedTokenSource.

build(work_dir)[source]

Build the token source.

Return type:

List[ConcatenatedTokenSource]

class olmo_core.data.composable.SlicedTokenSource(source, source_slice, *, work_dir, label=None)[source]

Bases: TokenSource

A token source that provides a slice of another token source.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

property num_tokens: int

The number of tokens available from this source.

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.SplitTokenSourceConfig(source, ratio, idx)[source]

Bases: TokenSourceConfig

A base config class for configuring and building a split TokenSource.

build(work_dir)[source]

Build the token source.

Return type:

List[SlicedTokenSource]

class olmo_core.data.composable.SamplingTokenSource(*sources, max_tokens, seed=0, work_dir, label=None)[source]

Bases: TokenSource

A token source that samples contiguous chunks of tokens from other token sources. This can be used to adjust the effective size of a source.

Tip

Unlike SamplingDocumentSource, this class doesn’t take document boundaries into account when sampling, but is much faster to set up.

Parameters:
  • sources (TokenSource) – The sources to sample tokens from.

  • max_tokens (int) – The maximum number of tokens to sample.

  • seed (Optional[int], default: 0) – A optional seed for sampling. If None, the first N_s tokens are taken from each source where N_s is proportional to the size of the source.

Warning

Generally you should prefer to use SamplingDocumentSource with random sampling (a seed provided) to preserve the distribution of child sources. This is a quick and dirty alternatively.

Config

alias of SamplingTokenSourceConfig

property num_tokens: int

The number of tokens available from this source.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.SamplingTokenSourceConfig(sources, max_tokens=None, factor=None, seed=<factory>, label=None)[source]

Bases: TokenSourceConfig

A config for building a SamplingTokenSource.

build(work_dir)[source]

Build the token source.

Return type:

List[SamplingTokenSource]

class olmo_core.data.composable.MixingTokenSource(*source_specs, work_dir, seed=0, label=None, num_tokens=None)[source]

Bases: TokenSource

A token source for mixing other token sources together with arbitrary ratios. Sampling within each source is done using SamplingTokenSource, which samples a consecutive chunk of tokens.

See also

Important

Sampling is done in a way that minimizes the number of dropped and repeated tokens while matching the target ratios and respecting the MixingTokenSourceSpec.max_repetition_factor values.

If num_tokens is not specified, then the number of tokens this source produces will always be less than or equal to the sum of tokens across all of its immediate children defined in the source_specs.

If num_tokens is specified, this class will try to match that size but may raise an OLMoConfigurationError if it’s not possible with the given max_repetition_factor values.

Parameters:
  • source_specs (MixingTokenSourceSpec) – The sources and how to sample from them.

  • num_tokens (Optional[int], default: None) – An optional target number of tokens for the mixed source.

Warning

Generally you should prefer to use MixingDocumentSource with random sampling (a seed provided) to preserve the distribution of child sources. This is a quick and dirty alternatively.

Config

alias of MixingTokenSourceConfig

Spec

alias of MixingTokenSourceSpec

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

property num_tokens: int

The number of tokens available from this source.

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.MixingTokenSourceConfig(source_specs, seed=<factory>, label=None, num_tokens=None)[source]

Bases: TokenSourceConfig

A config for MixingTokenSource.

source_specs: List[MixingTokenSourceSpecConfig]

Mixing source specs.

seed: Optional[int]

A random seed for sampling.

label: Optional[str] = None

An optional label for this source.

num_tokens: Optional[int] = None

An optional target number of tokens for the mixed source.

build(work_dir)[source]

Build the token source.

Return type:

List[MixingTokenSource]

class olmo_core.data.composable.InMemoryDocumentSource(tokens, *, tokenizer, work_dir, label_mask=None, label=None)[source]

Bases: InMemoryTokenSource, DocumentSource

An in-memory implementation of a DocumentSource. Primarily meant for testing.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

children()[source]

Get the child sources that make up this source, if any.

get_document_offsets()[source]

Get the start (inclusive) and end (exclusive) token indices of each document, in order.

Return type:

Iterable[tuple[int, int]]

class olmo_core.data.composable.ConcatenatedDocumentSource(*sources, work_dir, label=None)[source]

Bases: ConcatenatedTokenSource, DocumentSource

A document source that can be created from concatenating multiple other document sources.

Config

alias of ConcatenatedDocumentSourceConfig

get_document_offsets()[source]

Get the start (inclusive) and end (exclusive) token indices of each document, in order.

Return type:

Iterable[tuple[int, int]]

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.ConcatenatedDocumentSourceConfig(sources, label=None)[source]

Bases: DocumentSourceConfig

A base config class for configuring and building a ConcatenatedDocumentSource.

build(work_dir)[source]

Build the document source.

Return type:

List[ConcatenatedDocumentSource]

class olmo_core.data.composable.SamplingDocumentSource(*sources, max_tokens, seed=0, work_dir, label=None)[source]

Bases: DocumentSource

A document source that samples documents from other document sources. This can be used to adjust the effective size of a source.

Parameters:
  • sources (DocumentSource) – The sources to sample documents from.

  • max_tokens (int) – The maximum number of tokens to sample. The resulting source will have at most this many tokens, but potentially less because only whole documents are sampled.

  • seed (Optional[int], default: 0) – A optional seed for sampling documents. If None, no shuffling is done and the first documents are taken up to max_tokens.

Warning

It’s recommend to set a seed to ensure that the distribution of documents in child sources are preserved.

Config

alias of SamplingDocumentSourceConfig

property num_tokens: int

The number of tokens available from this source.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

get_document_offsets()[source]

Get the start (inclusive) and end (exclusive) token indices of each document, in order.

Return type:

Iterable[tuple[int, int]]

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.SamplingDocumentSourceConfig(sources, max_tokens=None, factor=None, seed=<factory>, label=None)[source]

Bases: DocumentSourceConfig

A config for building a SamplingDocumentSource.

build(work_dir)[source]

Build the document source.

Return type:

List[SamplingDocumentSource]

class olmo_core.data.composable.MixingDocumentSource(*source_specs, work_dir, seed=0, label=None, num_tokens=None)[source]

Bases: DocumentSource

A document source for mixing other document sources together with arbitrary ratios. Sampling within each source is done using SamplingDocumentSource, which samples whole documents.

See also

Important

Sampling is done in a way that minimizes the number of dropped and repeated tokens while matching the target ratios and respecting the MixingDocumentSourceSpec.max_repetition_factor values.

If num_tokens is not specified, then the number of tokens this source produces will always be less than or equal to the sum of tokens across all of its immediate children defined in the source_specs.

If num_tokens is specified, this class will try to match that size but may raise an OLMoConfigurationError if it’s not possible with the given max_repetition_factor values.

Parameters:
  • source_specs (MixingDocumentSourceSpec) – The sources and how to sample from them.

  • num_tokens (Optional[int], default: None) – An optional target number of tokens for the mixed source.

Config

alias of MixingDocumentSourceConfig

Spec

alias of MixingDocumentSourceSpec

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

property num_tokens: int

The number of tokens available from this source.

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

get_document_offsets()[source]

Get the start (inclusive) and end (exclusive) token indices of each document, in order.

Return type:

Iterable[tuple[int, int]]

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.MixingDocumentSourceConfig(source_specs, seed=<factory>, label=None, num_tokens=None)[source]

Bases: DocumentSourceConfig

A config for MixingDocumentSource.

source_specs: List[MixingDocumentSourceSpecConfig]

Mixing source specs.

seed: Optional[int]

A random seed for sampling.

label: Optional[str] = None

An optional label for this source.

num_tokens: Optional[int] = None

An optional target number of tokens for the mixed source.

build(work_dir)[source]

Build the document source.

Return type:

List[MixingDocumentSource]

class olmo_core.data.composable.NumpyDocumentSource(*, source_paths, dtype, work_dir, tokenizer, label_mask_paths=None, label=None, max_document_length=None, long_doc_strategy='truncate', _source_sizes=None, _label_mask_sizes=None)[source]

Bases: DocumentSource

A DocumentSource that reads tokens from one or more tokenized numpy source files.

Important

There’s some overhead when instantiating this class because it needs to query the sizes of all the source files. If you want to create multiple sources from the same set of files, consider first creating a single source and then splitting it up using split_by_source(), which will be much more efficient than creating multiple sources directly since the sizes of the source files will only need to be queried once and will be done so concurrently with a thread pool.

Parameters:
  • source_paths (Sequence[Union[Path, PathLike, str]]) – The paths/URLs to the numpy token ID arrays.

  • dtype (Union[Type[uint8], Type[uint16], Type[uint32], Type[uint64]]) – The numpy datatype of the token ID arrays in the source paths.

  • tokenizer (TokenizerConfig) – The config of the tokenizer that was used to tokenize the source files.

  • label_mask_paths (Optional[Sequence[Union[Path, PathLike, str]]], default: None) – The paths/URLs to numpy bool files indicating which tokens should be masked.

  • max_document_length (Optional[int], default: None) – The maximum document length to use when iterating over documents. If not None, documents longer than this will either be fragmented or truncated depending on the long_doc_strategy`.

  • long_doc_strategy (LongDocStrategy, default: 'truncate') – How to handle long documents when max_document_length is set.

Config

alias of NumpyDocumentSourceConfig

MixConfig

alias of NumpyDocumentSourceMixConfig

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

property num_tokens: int

The number of tokens available from this source.

split_by_source(group_size=1)[source]

Split the source up into multiple smaller sources from groups of source files.

Return type:

List[NumpyDocumentSource]

get_token_range(start_idx, end_idx)[source]

Get a range of contiguous tokens starting from start_idx (0-based, inclusive) to end_idx (exclusive).

Since a TokenSource isn’t necessarily aware of document boundaries (see DocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. an InstanceSource) to get ranges that make sense for their use case.

Return type:

TokenRange

get_document_offsets()[source]

Get the start (inclusive) and end (exclusive) token indices of each document, in order.

Return type:

Iterable[tuple[int, int]]

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.NumpyDocumentSourceConfigBase(*, tokenizer, dtype=None, source_permutation_seed=None, source_group_size=1, label=None, max_document_length=None, long_doc_strategy='truncate')[source]

Bases: DocumentSourceConfig

Base config class for NumpyDocumentSourceConfig and NumpyDocumentSourceMixConfig.

tokenizer: TokenizerConfig

The config of the tokenizer that was used to tokenize the source files.

dtype: Optional[NumpyDatasetDType] = None

The numpy datatype of the token ID arrays in the source paths.

source_permutation_seed: Optional[int] = None

Used to shuffle the source files before grouping/building the document sources.

source_group_size: int = 1

The number of npy source files to group together into a single source.

label: Optional[str] = None

An optional to assign for logging and debugging.

max_document_length: Optional[int] = None

The maximum document length to use when iterating over documents. If not None, documents longer than this will either be fragmented or truncated depending on the long_doc_strategy`.

long_doc_strategy: LongDocStrategy = 'truncate'

How to handle long documents when max_document_length is set.

class olmo_core.data.composable.NumpyDocumentSourceConfig(*, tokenizer, dtype=None, source_permutation_seed=None, source_group_size=1, label=None, max_document_length=None, long_doc_strategy='truncate', source_paths, label_mask_paths=None, expand_glob=None)[source]

Bases: NumpyDocumentSourceConfigBase

Config class for building one or more NumpyDocumentSource directly from source paths.

source_paths: List[str]

The paths/URLs to the numpy token ID arrays.

label_mask_paths: Optional[List[str]] = None

The paths/URLs to numpy bool files indicating which tokens should be masked.

expand_glob: Optional[bool] = None

If true, treat source/label paths as glob patterns and expand them when building the sources.

classmethod from_source_groups(source_path_groups, *, tokenizer, label_mask_path_groups=None, expand_glob=None, **kwargs)[source]

A more efficient way to create multiple configs from groups of source paths. This will use a thread pool to expand all globs concurrently, which can be substantially faster especially when some of the globs point to cloud storage URLs.

Parameters:
  • source_path_groups (Dict[str, List[Union[Path, PathLike, str]]]) – Groups of source paths to use. Each group will be put into its own config with the corresponding label.

  • tokenizer (TokenizerConfig) – The tokenizer config to use.

  • label_mask_path_groups (Optional[Dict[str, List[Union[Path, PathLike, str]]]], default: None) – Optional groups of label mask paths to use. Each group should correspond to the group in source_paths at the same key.

Return type:

Dict[str, NumpyDocumentSourceConfig]

build(work_dir)[source]

Build the sources. :rtype: List[NumpyDocumentSource]

Note

The number of sources returned depends on the length of source_paths and the value of source_group_size.

class olmo_core.data.composable.NumpyDocumentSourceMixConfig(*, tokenizer, dtype=None, source_permutation_seed=None, source_group_size=1, label=None, max_document_length=None, long_doc_strategy='truncate', mix, mix_base_dir)[source]

Bases: NumpyDocumentSourceConfigBase

Config class for building one or more NumpyDocumentSource from a predefined source mix.

mix: Union[str, DataMixBase]

The name of a data mix (e.g. "dolma17").

mix_base_dir: str

The base directory of the data mix.

build(work_dir)[source]

Build the sources. :rtype: List[NumpyDocumentSource]

Note

The number of sources returned depends on the number of paths in the mix and the value of source_group_size.

class olmo_core.data.composable.ConcatAndChunkInstanceSource(*sources, sequence_length, work_dir, max_sequence_length=None, label=None)[source]

Bases: InstanceSource

The basic instance source that simply chunks up token sources without regard for document boundaries, just like the NumpyFSLDataset.

Config

alias of ConcatAndChunkInstanceSourceConfig

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

__len__()[source]

The number of instances available from this source.

Return type:

int

__getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.ConcatAndChunkInstanceSourceConfig(sources, sequence_length, max_sequence_length=None, label=None)[source]

Bases: InstanceSourceConfig

Config for ConcatAndChunkInstanceSource.

classmethod from_npy(*npy_paths, tokenizer, sequence_length, max_sequence_length=None, dtype=None, source_permutation_seed=None, source_group_size=1, label_mask_paths=None, expand_glob=None, label=None)[source]

Create a ConcatAndChunkInstanceSourceConfig from one or more tokenized .npy source files.

Return type:

ConcatAndChunkInstanceSourceConfig

build(work_dir)[source]

Build the InstanceSource.

Return type:

ConcatAndChunkInstanceSource

class olmo_core.data.composable.PackingInstanceSource(*sources, sequence_length, work_dir, tokenizer, max_sequence_length=None, long_doc_strategy='truncate', source_group_size=1, label=None)[source]

Bases: InstanceSource

Like the NumpyPackedFSLDataset, this instance source packs documents from each DocumentSource into instances using the Optimized Best-Fit Decreasing (OBFD) algorithm described in Fewer Truncations Improve Language Modeling. The resulting instances will all have exactly sequence_length tokens, using padding if needed.

Note

By default OBFD is applied to each source separately since source files from the Dolma toolkit are usually large enough for OBFD to achieve very good compactness (minimal padding tokens) and so that we can parallelize the packing. However, you can pack instances from multiple consecutive sources together by setting source_group_size to a value greater than 1.

Parameters:
  • sources (DocumentSource) – Sources of documents to pack.

  • sequence_length (int) – The sequence length of each instance, i.e. the maximum number of tokens that can be packed into each instance.

  • tokenizer (TokenizerConfig) – The tokenizer configuration.

  • max_sequence_length (Optional[int], default: None) – This must be equal to sequence_length if given.

  • long_doc_strategy (LongDocStrategy, default: 'truncate') – The strategy to use for documents longer than sequence_length.

  • source_group_size (int, default: 1) – The number of consecutive sources to pack together.

Config

alias of PackingInstanceSourceConfig

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

children()[source]

Get the child sources that make up this source, if any.

__len__()[source]

The number of instances available from this source.

Return type:

int

__getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

class olmo_core.data.composable.PackingInstanceSourceConfig(sources, sequence_length, tokenizer, max_sequence_length=None, long_doc_strategy='truncate', source_group_size=1, label=None)[source]

Bases: InstanceSourceConfig

Config for PackingInstanceSource.

classmethod from_npy(*npy_paths, tokenizer, sequence_length, max_sequence_length=None, dtype=None, source_permutation_seed=None, source_group_size=1, label_mask_paths=None, expand_glob=None, label=None, long_doc_strategy='truncate')[source]

Create a PackingInstanceSourceConfig from one or more tokenized .npy source files.

Return type:

PackingInstanceSourceConfig

build(work_dir)[source]

Build the InstanceSource.

Return type:

PackingInstanceSource

class olmo_core.data.composable.ConcatenatedInstanceSource(*sources, work_dir, label=None)[source]

Bases: InstanceSource

An instance source that concatenates multiple instance sources together end-to-end.

Config

alias of ConcatenatedInstanceSourceConfig

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

__len__()[source]

The number of instances available from this source.

Return type:

int

__getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.ConcatenatedInstanceSourceConfig(sources)[source]

Bases: InstanceSourceConfig

A config for a ConcatenatedInstanceSource.

build(work_dir)[source]

Build the InstanceSource.

Return type:

ConcatenatedInstanceSource

class olmo_core.data.composable.SlicedInstanceSource(source, source_slice, *, seed=None, work_dir)[source]

Bases: InstanceSource

An instance source that provides a slice of another instance source.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

__len__()[source]

The number of instances available from this source.

Return type:

int

__getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.SplitInstanceSourceConfig(source, ratio, idx, seed=None)[source]

Bases: InstanceSourceConfig

A base config class for configuring and building a split InstanceSource.

build(work_dir)[source]

Build the InstanceSource.

Return type:

InstanceSource

class olmo_core.data.composable.SamplingInstanceSource(*sources, max_tokens=None, max_instances=None, work_dir, seed=0, label=None)[source]

Bases: InstanceSource

An instance source that samples instances from other instance sources. This can be used to adjust the effective size of a source.

Parameters:
  • sources (InstanceSource) – The sources to sample instances from.

  • max_tokens (Optional[int], default: None) – The maximum number of tokens to sample. Alternatively you can specify max_instances.

  • max_instances (Optional[int], default: None) – The maximum number of instances to sample. Mutually exclusive with max_tokens.

  • seed (Optional[int], default: 0) – A optional seed for sampling. If None, the first N_s instances are taken from each source where N_s is proportional to the size of the source.

Warning

It’s recommend to set a seed to ensure that the distribution of instances in child sources are preserved.

Config

alias of SamplingInstanceSourceConfig

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

children()[source]

Get the child sources that make up this source, if any.

__len__()[source]

The number of instances available from this source.

Return type:

int

__getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

class olmo_core.data.composable.SamplingInstanceSourceConfig(sources, max_tokens=None, max_instances=None, factor=None, seed=<factory>, label=None)[source]

Bases: InstanceSourceConfig

Config for SamplingInstanceSource.

build(work_dir)[source]

Build the InstanceSource.

Return type:

SamplingInstanceSource

class olmo_core.data.composable.MixingInstanceSource(*source_specs, work_dir, seed=0, label=None, num_tokens=None, num_instances=None)[source]

Bases: InstanceSource

An instance source for mixing other instance sources together with arbitrary ratios. Sampling within each source is done using SamplingInstanceSource, which samples whole instances.

See also

Important

Sampling is done in a way that minimizes the number of dropped instances while matching the target ratios and respecting the MixingInstanceSourceSpec.max_repetition_factor values.

If neither num_tokens nor num_instances is specified, then the number of instances this source produces will always be less than or equal to the sum of instances across all of its immediate children defined in the source_specs.

If num_tokens or num_instances is specified, this class will try to match that size but may raise an OLMoConfigurationError if it’s not possible with the given max_repetition_factor values.

Parameters:
  • source_specs (MixingInstanceSourceSpec) – The sources and how to sample from them.

  • num_tokens (Optional[int], default: None) – An optional target number of tokens for the mixed source. Mutually exclusive with num_instances.

  • num_instances (Optional[int], default: None) – An optional target number of instances for the mixed source. Mutually exclusive with num_tokens.

Config

alias of MixingInstanceSourceConfig

Spec

alias of MixingInstanceSourceSpec

children()[source]

Get the child sources that make up this source, if any.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

__len__()[source]

The number of instances available from this source.

Return type:

int

__getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

class olmo_core.data.composable.MixingInstanceSourceConfig(source_specs, seed=<factory>, label=None, num_tokens=None, num_instances=None)[source]

Bases: InstanceSourceConfig

A config for MixingInstanceSource.

source_specs: List[MixingInstanceSourceSpecConfig]

Mixing source specs.

seed: Optional[int]

A random seed for sampling.

label: Optional[str] = None

An optional label for this source.

num_tokens: Optional[int] = None

An optional target number of tokens for the mixed source.

num_instances: Optional[int] = None

An optional target number of instances for the mixed source.

build(work_dir)[source]

Build the InstanceSource.

Return type:

MixingInstanceSource

class olmo_core.data.composable.RandomInstanceSource(*, tokenizer, sequence_length, avg_document_length, seed=0, num_instances=None, num_tokens=None, max_sequence_length=None, label=None, work_dir)[source]

Bases: InstanceSource

An instance source that generates random instances. Useful for benchmarking.

Config

alias of RandomInstanceSourceConfig

property num_tokens: int

The number of tokens available from this source.

property fingerprint: str

A unique, deterministic string representing the ordered contents of the source.

__len__()[source]

The number of instances available from this source.

Return type:

int

__getitem__(idx)[source]

Get an instance by index.

Return type:

Instance

children()[source]

Get the child sources that make up this source, if any.

class olmo_core.data.composable.RandomInstanceSourceConfig(tokenizer, sequence_length, avg_document_length, seed=<factory>, num_instances=None, num_tokens=None, max_sequence_length=None, label=None)[source]

Bases: InstanceSourceConfig

Config for RandomInstanceSource.

build(work_dir)[source]

Build the InstanceSource.

Return type:

RandomInstanceSource

class olmo_core.data.composable.InstanceFilterConfig(repetition_max_period=13, repetition_min_period=1, repetition_max_count=32)[source]

Bases: Config

Config for instance filtering.

class olmo_core.data.composable.LongDocStrategy(value)[source]

Bases: StrEnum

Specifies how to handle documents that are longer than the max sequence length when packing.

truncate = 'truncate'

Long docs are truncated and the excess tokens are discarded.

fragment = 'fragment'

Long docs are split into smaller docs so that no tokens are discarded, but you end up with fragmented docs.

class olmo_core.data.composable.ShuffleStrategy(value)[source]

Bases: StrEnum

Defines how the data is shuffled.

inter_source = 'inter_source'

Shuffle across all sources as if they were one big source.

intra_source = 'intra_source'

Shuffle within each source, then concatenate the sources in order. This can be used to create a data curriculum.

interleaved_source = 'interleaved_source'

Shuffle within each source and then interleave instances from each source.

class olmo_core.data.composable.MixingInstanceSourceSpec(source, ratio, max_repetition_factor=1.0, label=None)[source]

Bases: object

Defines a source and its associated mixing ratio for MixingInstanceSource.

Config

alias of MixingInstanceSourceSpecConfig

source: InstanceSource

The source.

ratio: float

The relative target ratio for this source. If the ratios across all source specs don’t sum to 1.0 then they’ll be normalized.

max_repetition_factor: float = 1.0

The maximum amount of repetition allowed, expressed as a factor greater than or equal to 1.0. A factor of 1.0 means no repetition is allowed. A factor of 2.0 means each instance could be repeated at most once (i.e., seen twice).

label: Optional[str] = None

An optional label for this source.

class olmo_core.data.composable.MixingInstanceSourceSpecConfig(source, ratio, max_repetition_factor=1.0, label=None)[source]

Bases: Config

Config for MixingInstanceSourceSpec.

class olmo_core.data.composable.MixingTokenSourceSpec(source, ratio, max_repetition_factor=1.0, label=None)[source]

Bases: object

Defines a source and its associated mixing ratio for MixingTokenSource.

Config

alias of MixingTokenSourceSpecConfig

source: TokenSource

The source.

ratio: float

The relative target ratio for this source. If the ratios across all source specs don’t sum to 1.0 then they’ll be normalized.

max_repetition_factor: float = 1.0

The maximum amount of repetition allowed, expressed as a factor greater than or equal to 1.0. A factor of 1.0 means no repetition is allowed. A factor of 2.0 means each token could be repeated at most once (i.e., seen twice).

label: Optional[str] = None

An optional label for this source.

class olmo_core.data.composable.MixingTokenSourceSpecConfig(source, ratio, max_repetition_factor=1.0, label=None)[source]

Bases: Config

Config for MixingTokenSourceSpec.

class olmo_core.data.composable.MixingDocumentSourceSpec(source, ratio, max_repetition_factor=1.0, label=None)[source]

Bases: object

Defines a source and its associated mixing ratio for MixingDocumentSource.

Config

alias of MixingDocumentSourceSpecConfig

source: DocumentSource

The source.

ratio: float

The relative target ratio for this source. If the ratios across all source specs don’t sum to 1.0 then they’ll be normalized.

max_repetition_factor: float = 1.0

The maximum amount of repetition allowed, expressed as a factor greater than or equal to 1.0. A factor of 1.0 means no repetition is allowed. A factor of 2.0 means each document could be repeated at most once (i.e., seen twice).

label: Optional[str] = None

An optional label for this source.

class olmo_core.data.composable.MixingDocumentSourceSpecConfig(source, ratio, max_repetition_factor=1.0, label=None)[source]

Bases: Config

Config for MixingDocumentSourceSpec.

olmo_core.data.composable.set_composable_seed(seed)[source]

Set the global seed for the composable module.

olmo_core.data.composable.reset_composable_seed()[source]

Reset the global seed for the composable module.