data.composable¶
Overview¶
A composable data loading API for fixed sequence length text data.
┌─────────────┐ ┌────────────────┐ ┌──────────────────────┐
│ TokenSource │ ⇢ ⋯ ⇢ │ InstanceSource │ ⇢ ⋯ ⇢ │ ComposableDataLoader │
└─────────────┘ └────────────────┘ └──────────────────────┘
This API consists of a series of simple, composable, elements, including:
TokenSource/DocumentSource: Token sources provide access to tokenized text data, while document sources are special token sources that also provide information on where the document boundaries are. Examples include:InMemoryTokenSource/InMemoryDocumentSource: A simple token/document source that holds all tokens in memory.ConcatenatedTokenSource/ConcatenatedDocumentSource: A token/document source that combines multiple sources into one.SlicedTokenSource: A token source that provides a slice into another token source.NumpyDocumentSource: A document that reads tokens from one or more numpy source files, like those created from the dolma toolkit.SamplingTokenSource/SamplingDocumentSource: A token/document source that samples tokens/documents from one or more other token/document sources.MixingTokenSource/MixingDocumentSource: A token/document source that mixes other token/document sources together.
InstanceSource: Instance sources convert token sources (or in some case other instance sources) into fixed-length instances. Examples include:ConcatAndChunkInstanceSource: The simplest instance source that chunks up token sources without regard for document boundaries, just like theNumpyFSLDataset.PackingInstanceSource: An instance source that packs documents from one or more document sources into instances using an optimized packing algorithm, just like theNumpyPackedFSLDataset.ConcatenatedInstanceSource: An instance source combines instances from other instance sources.SlicedInstanceSource: An instance source that provides a slice into another instance source.SamplingInstanceSource: An instance source that samples instances from other instance sources.MixingInstanceSource: An instance source that mixes other instance sources together.RandomInstanceSource: An instance source for generating random instances.
ComposableDataLoader: A data loader for OLMo-core’sTrainerthat takes one or more instance sources.
Tip
Use InstanceSource.visualize() to print out a recursive visualization of an instance
source and all its sub-sources.
Basic Examples¶
Create a simple instance source that chunks up in-memory token sources:
from olmo_core.data.composable import *
work_dir = "/tmp/dataset-common"
source = ConcatAndChunkInstanceSource(
InMemoryTokenSource(list(range(100)), work_dir=work_dir),
sequence_length=10,
work_dir=work_dir,
)
source.visualize()
ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
└─ InMemoryTokenSource(73b91ee): 100 tokens
Split the source into train and test sets:
train_source, test_source = source.split(0.8)
train_source.visualize()
test_source.visualize()
SlicedInstanceSource(d01d0e2): 80 tokens
└─ ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
└─ InMemoryTokenSource(73b91ee): 100 tokens
SlicedInstanceSource(a5a511f): 20 tokens
└─ ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
└─ InMemoryTokenSource(73b91ee): 100 tokens
Sample a subset of a source:
train_source = train_source.sample(max_tokens=50)
train_source.visualize()
SamplingInstanceSource(77d8031): 50 tokens
└─ SlicedInstanceSource(d01d0e2): 80 tokens
└─ ConcatAndChunkInstanceSource(ee7a76d): 100 tokens
└─ InMemoryTokenSource(73b91ee): 100 tokens
Create a mix of token sources:
tokens1 = InMemoryTokenSource(list(range(100)), work_dir=work_dir, label="source1")
tokens2 = InMemoryTokenSource(list(range(100, 200)), work_dir=work_dir, label="source2")
tokens_mix = MixingTokenSource(
MixingTokenSource.Spec(source=tokens1, ratio=0.5),
MixingTokenSource.Spec(source=tokens2, ratio=0.5),
work_dir=work_dir,
)
source = ConcatAndChunkInstanceSource(tokens_mix, sequence_length=10, work_dir=work_dir)
source.visualize()
ConcatAndChunkInstanceSource(4820826): 200 tokens
└─ MixingTokenSource(5fc211a): 200 tokens
├─ SamplingTokenSource(7adca21): 100 tokens [source1]
│ └─ InMemoryTokenSource(73b91ee): 100 tokens [source1]
└─ SamplingTokenSource(baf2e4f): 100 tokens [source2]
└─ InMemoryTokenSource(a9e49e1): 100 tokens [source2]
Working with numpy source files¶
You can use the same numpy tokenized source files that the dataset classes in olmo_core.data.numpy_dataset
consume by starting with the NumpyDocumentSource or its corresponding config class, NumpyDocumentSourceConfig.
For example:
source_config = NumpyDocumentSource.Config(
source_paths=[
"gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy"
],
tokenizer=tokenizer,
)
sources = source_config.build(work_dir)
[
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-00-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-01-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-02-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-03-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-04-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-05-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-06-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-07-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-08-00000.npy',),
NumpyDocumentSource('gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/part-09-00000.npy',)
]
Ratio-based mixing¶
Here’s a more useful example where we create several groups of numpy sources, two code and two math,
using NumpyDocumentSourceConfig.from_source_groups():
sequence_length = 8192
token_sources = NumpyDocumentSource.Config.from_source_groups(
{
"code_fim": [
"gs://ai2-llm/preprocessed/stack-edu/sample-fim-weighted-pl-edu-score-decon/**/**/*.npy",
],
"swallowcode": [
"gs://ai2-llm/preprocessed/tokyotech-llm/swallowcode/scor_final_data-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy"
],
"megamath": [
"gs://ai2-llm/preprocessed/megamath_web_pro_max/beaker_rewrites-decon-sparkle-motion/**/allenai/dolma2-tokenizer/*.npy"
],
"dolminos2math": [
"gs://ai2-llm/preprocessed/tokyotech-llm/swallowmath/beaker_outputs-decon-sparkle-motion-withids/allenai/dolma2-tokenizer/*.npy",
"gs://ai2-llm/preprocessed/midtraining-reasoning/flat_dolmino_math-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
"gs://ai2-llm/preprocessed/midtraining-reasoning/OpenMathReasoning/OpenMathReasoning-rewrite-full-thoughts/jsonls-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
"gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/MIND/data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
"gs://ai2-llm/preprocessed/midtraining-reasoning/tinyMATH/PoT/processed_data/processed-decon-sparkle-motion/allenai/dolma2-tokenizer/*.npy",
],
},
tokenizer=tokenizer,
)
And then mix them together at the instance level in a hierarchical fashion with a MixingInstanceSource
to get a source with 30B tokens:
def make_instance_source(label: str) -> InstanceSourceConfig:
return ConcatAndChunkInstanceSource.Config(
sources=[token_sources[label]], label=label, sequence_length=sequence_length
)
mix_config = MixingInstanceSource.Config(
source_specs=[
################
# code sources #
################
MixingInstanceSource.Spec.Config(
source=MixingInstanceSource.Config(
source_specs=[
MixingInstanceSource.Spec.Config(
source=make_instance_source("code_fim"),
ratio=0.5,
label="code_fim",
),
MixingInstanceSource.Spec.Config(
source=make_instance_source("swallowcode"),
ratio=0.5,
label="swallowcode",
),
]
),
ratio=0.5,
label="code",
),
################
# math sources #
################
MixingInstanceSource.Spec.Config(
source=MixingInstanceSource.Config(
source_specs=[
MixingInstanceSource.Spec.Config(
source=make_instance_source("megamath"),
ratio=0.1,
label="megamath",
),
MixingInstanceSource.Spec.Config(
source=make_instance_source("dolminos2math"),
ratio=0.9,
label="dolminos2math",
),
]
),
ratio=0.5,
label="math",
),
],
num_tokens=30_000_000_000,
)
mix = mix_config.build("/tmp/dataset-common")
mix.visualize()
MixingInstanceSource(e421147): 30.0B tokens
├─ SamplingInstanceSource(c65cde2): 15.0B tokens [code]
│ └─ MixingInstanceSource(73dfe43): 37.7B tokens
│ ├─ SamplingInstanceSource(85521bf): 18.8B tokens [code_fim]
│ │ └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
│ │ └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
│ └─ SamplingInstanceSource(8d7c840): 18.8B tokens [swallowcode]
│ └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
│ └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
└─ SamplingInstanceSource(03941ca): 15.0B tokens [math]
└─ MixingInstanceSource(39aa7de): 20.3B tokens
├─ SamplingInstanceSource(cbc20a2): 2.0B tokens [megamath]
│ └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
│ └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
└─ SamplingInstanceSource(857de5e): 18.3B tokens [dolminos2math]
└─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
└─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]
Tip
The ratios (e.g. MixingInstanceSourceSpec.ratio) for each source within a mix don’t necessary need to sum to 1.0, but you’ll see
a warning if they don’t and they’ll be normalized before being applied:
UserWarning: Target mixing ratios don't sum to 1. They will be normalized as follows:
❯ Source 'math': target ratio adjusted from 0.7 to 0.7368421052631579
❯ Source 'code': target ratio adjusted from 0.25 to 0.2631578947368421
Up-sampling or targeted repetition¶
Suppose we wanted to simulate training 3 epochs on the mixture above, i.e. training on 3 repetitions
of the data.
In general you can do exact up-sampling by wrapping a source in
a SamplingInstanceSource (or SamplingTokenSource, SamplingDocumentSource),
or by calling the .sample() / .resize() methods:
upsampled_mix = mix.resize(3.0)
upsampled_mix.visualize()
SamplingInstanceSource(de59d5e): 90.0B tokens
└─ MixingInstanceSource(e421147): 30.0B tokens
├─ SamplingInstanceSource(c65cde2): 15.0B tokens [code]
│ └─ MixingInstanceSource(73dfe43): 37.7B tokens
│ ├─ SamplingInstanceSource(85521bf): 18.8B tokens [code_fim]
│ │ └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
│ │ └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
│ └─ SamplingInstanceSource(8d7c840): 18.8B tokens [swallowcode]
│ └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
│ └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
└─ SamplingInstanceSource(03941ca): 15.0B tokens [math]
└─ MixingInstanceSource(39aa7de): 20.3B tokens
├─ SamplingInstanceSource(cbc20a2): 2.0B tokens [megamath]
│ └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
│ └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
└─ SamplingInstanceSource(857de5e): 18.3B tokens [dolminos2math]
└─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
└─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]
Curriculum learning¶
The composable API also enables curriculum learning. Suppose we want the first half of training to focus on 25% code + 75% math, and the second half to focus on 75% code + 25% math.
We’ll start by randomly splitting each of our sources, and since we’ll want to set RNG seeds in
multiple places, we’ll use the helper function set_composable_seed() to set the global starting seed
so that we don’t have to set a different seed explicitly everywhere one is required:
set_composable_seed(42)
instance_sources = {
"code_fim": make_instance_source("code_fim").random_split(0.25),
"swallowcode": make_instance_source("swallowcode").random_split(0.25),
"megamath": make_instance_source("megamath").random_split(0.75),
"dolminos2math": make_instance_source("dolminos2math").random_split(0.75),
}
And then we can create two separate mixes with the splits:
def make_source_spec(label: str, split: int, ratio: float) -> MixingInstanceSourceSpecConfig:
return MixingInstanceSource.Spec.Config(
source=instance_sources[label][split],
ratio=ratio,
label=label,
)
mix_config1 = MixingInstanceSource.Config(
source_specs=[
MixingInstanceSource.Spec.Config(
source=MixingInstanceSource.Config(
source_specs=[
make_source_spec("code_fim", 0, 0.5),
make_source_spec("swallowcode", 0, 0.5),
]
),
ratio=0.25,
label="code",
),
MixingInstanceSource.Spec.Config(
source=MixingInstanceSource.Config(
source_specs=[
make_source_spec("megamath", 0, 0.1),
make_source_spec("dolminos2math", 0, 0.9),
]
),
ratio=0.75,
label="math",
),
],
)
mix_config2 = MixingInstanceSource.Config(
source_specs=[
MixingInstanceSource.Spec.Config(
source=MixingInstanceSource.Config(
source_specs=[
make_source_spec("code_fim", 1, 0.5),
make_source_spec("swallowcode", 1, 0.5),
]
),
ratio=0.75,
label="code",
),
MixingInstanceSource.Spec.Config(
source=MixingInstanceSource.Config(
source_specs=[
make_source_spec("megamath", 1, 0.1),
make_source_spec("dolminos2math", 1, 0.9),
]
),
ratio=0.25,
label="math",
),
],
)
mix1 = mix_config1.build("/tmp/dataset-common")
mix1.visualize()
mix2 = mix_config2.build("/tmp/dataset-common")
mix2.visualize()
MixingInstanceSource(6544ac1): 20.3B tokens
├─ SamplingInstanceSource(d28e959): 5.1B tokens [code]
│ └─ MixingInstanceSource(08c8aa6): 9.4B tokens
│ ├─ SamplingInstanceSource(c8e7179): 4.7B tokens [code_fim]
│ │ └─ SlicedInstanceSource(02637e6): 5.3B tokens [code_fim]
│ │ └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
│ │ └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
│ └─ SamplingInstanceSource(520fff5): 4.7B tokens [swallowcode]
│ └─ SlicedInstanceSource(fcaebbc): 4.7B tokens [swallowcode]
│ └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
│ └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
└─ SamplingInstanceSource(b33b3f1): 15.2B tokens [math]
└─ MixingInstanceSource(47322cf): 15.2B tokens
├─ SamplingInstanceSource(ccadff0): 1.5B tokens [megamath]
│ └─ SlicedInstanceSource(c4cd38d): 2.9B tokens [megamath]
│ └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
│ └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
└─ SamplingInstanceSource(476779e): 13.7B tokens [dolminos2math]
└─ SlicedInstanceSource(75c18b6): 13.7B tokens [dolminos2math]
└─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
└─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]
MixingInstanceSource(02f0d21): 20.3B tokens
├─ SamplingInstanceSource(76da23b): 15.2B tokens [code]
│ └─ MixingInstanceSource(cb963a6): 28.3B tokens
│ ├─ SamplingInstanceSource(5c8e643): 14.1B tokens [code_fim]
│ │ └─ SlicedInstanceSource(f0ca032): 16.0B tokens [code_fim]
│ │ └─ ConcatAndChunkInstanceSource(adb4562): 21.4B tokens [code_fim]
│ │ └─ NumpyDocumentSource x 474: 21.4B tokens [code_fim]
│ └─ SamplingInstanceSource(79187df): 14.1B tokens [swallowcode]
│ └─ SlicedInstanceSource(0ae4650): 14.1B tokens [swallowcode]
│ └─ ConcatAndChunkInstanceSource(b2f2ef4): 18.8B tokens [swallowcode]
│ └─ NumpyDocumentSource x 128: 18.8B tokens [swallowcode]
└─ SamplingInstanceSource(8e32820): 5.1B tokens [math]
└─ MixingInstanceSource(25a8bb7): 5.1B tokens
├─ SamplingInstanceSource(70d1102): 507.6M tokens [megamath]
│ └─ SlicedInstanceSource(d3ca4bc): 970.6M tokens [megamath]
│ └─ ConcatAndChunkInstanceSource(2b6a324): 3.9B tokens [megamath]
│ └─ NumpyDocumentSource x 264: 3.9B tokens [megamath]
└─ SamplingInstanceSource(afd6a12): 4.6B tokens [dolminos2math]
└─ SlicedInstanceSource(b53996f): 4.6B tokens [dolminos2math]
└─ ConcatAndChunkInstanceSource(b768b9a): 18.3B tokens [dolminos2math]
└─ NumpyDocumentSource x 415: 18.3B tokens [dolminos2math]
When we build our ComposableDataLoader we’ll pass it both of those mixes, in order, and specify
the ShuffleStrategy as intra_source
so that each mix is shuffled independently during its phase of training:
data_loader = ComposableDataLoader.Config(
tokenizer=tokenizer,
global_batch_size=512 * sequence_length,
shuffle_strategy=ShuffleStrategy.intra_source,
).build(mix1, mix2, work_dir="/tmp/dataloader-common")
Alternatively you could set sources_per_epoch=1 to tell the data loader to use only the first source
for the first epoch, the second source for the second epoch, and so on:
data_loader = ComposableDataLoader.Config(
tokenizer=tokenizer,
global_batch_size=512 * sequence_length,
sources_per_epoch=1,
).build(mix1, mix2, work_dir="/tmp/dataloader-common")
Reference¶
- class olmo_core.data.composable.SourceABC(*, work_dir, label=None)[source]¶
Bases:
objectAbstract base class for source types.
- Parameters:
- property work_dir: Path¶
The class-specific local working directory that can be used by the source for caching files during preprocessing.
- property fs_local_rank: int¶
The local rank of the current process with respect to filesystem access of the working directory.
- abstract property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- class olmo_core.data.composable.TokenSource(*, work_dir, label=None)[source]¶
Bases:
SourceABCAn abstract base class for a source of tokens, usually consumed by an
InstanceSource. It essentially represents an array of tokens.At a minimum, a
TokenSourcemust implement the methods/properties (1)num_tokens(), (2)get_token_range(), (3)fingerprint(), and (4)children().- __len__()[source]¶
The number of tokens available from this source, same as
self.num_tokens.- Return type:
- abstract get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- __getitem__(key)[source]¶
Get a range of tokens using either an integer index (for a singular token range) or a slice.
- Return type:
- __add__(other)[source]¶
Add two token sources together into a
ConcatenatedTokenSourceorConcatenatedDocumentSourcedepending on the type ofselfandother.- Return type:
- __mul__(factor)[source]¶
Re-size this source by a given factor by sampling tokens from it.
- Return type:
- sample(*, max_tokens, seed=0)[source]¶
Sample a contiguous chunk of tokens from this source.
See also
- Parameters:
- Return type:
- resize(factor, seed=0)[source]¶
Re-size this source by a given factor by sampling a contiguous chunk of tokens from it.
See also
- Parameters:
- Return type:
- class olmo_core.data.composable.TokenSourceConfig[source]¶
Bases:
ConfigA base config class for configuring and building a
TokenSource.- __add__(other)[source]¶
Add two token source config together into a
ConcatenatedTokenSourceConfigorConcatenatedDocumentSourceConfigdepending on the type ofselfandother.- Return type:
- __mul__(factor)[source]¶
Re-size this source by a given factor by sampling tokens from it.
- Return type:
- sample(*, max_tokens, seed=0)[source]¶
Sample a contiguous chunk of tokens from this source.
- Parameters:
- Return type:
- resize(factor, seed=0)[source]¶
Re-size this source by a given factor by sampling a contiguous chunk of tokens from it.
- Parameters:
- Return type:
- class olmo_core.data.composable.DocumentSource(*, work_dir, label=None)[source]¶
Bases:
TokenSourceAn abstract base class for a particular type of
TokenSourcethat’s aware of document boundaries. This class has one additional abstract method:get_document_offsets().- sample_by_docs(*, max_tokens, seed=0)[source]¶
Sample documents from this source.
See also
- Parameters:
- Return type:
- class olmo_core.data.composable.DocumentSourceConfig[source]¶
Bases:
TokenSourceConfigA base config class for configuring and building a
DocumentSource.- sample_by_docs(*, max_tokens, seed=0)[source]¶
Sample documents from this source.
See also
- Parameters:
- Return type:
- class olmo_core.data.composable.TokenRange[source]¶
Bases:
TypedDictA token range is just a dictionary that should include
input_idsof the range and optionally a correspondinglabel_mask.
- class olmo_core.data.composable.InstanceSource(*, work_dir, sequence_length, max_sequence_length=None, label=None)[source]¶
Bases:
SourceABCAn abstract base class for a source of instances, usually consumed by a
ComposableDataLoader. It essentially represents an array of instances, where each instance is a sequence ofsequence_lengthtokens.- Parameters:
sequence_length (
int) – The length of each sequence (instance) to produce.max_sequence_length (
Optional[int], default:None) – For sources that support this. If you intend to increase the sequence length in the middle of an epoch, you should set this to the maximum sequence length that you’ll train on to guarantee that you can restart the run with the same data order after changing sequence length. Care needs to be taken when implementing this in a subclass to ensure that the exact same tokens will be produced when sequence_length is changed but max_sequence_length is fixed.
- property max_sequence_length: int¶
Typically the same as
sequence_lengththough in some cases it can be greater, such as when the sequence length will be increased in the middle of an epoch.
- __add__(other)[source]¶
Add two instance sources together into a
ConcatenatedInstanceSource.- Return type:
- __mul__(factor)[source]¶
Re-size this source by a given factor by sampling instances from it.
- Return type:
- sample(*, max_tokens=None, max_instances=None, seed=0)[source]¶
Sample instances from this source.
- Parameters:
max_tokens (
Optional[int], default:None) – The maximum number of tokens to sample from this source. Mutually exclusive withmax_instances.max_instances (
Optional[int], default:None) – The maximum number of instances to sample from this source. Mutually exclusive withmax_tokens.seed (
Optional[int], default:0) – A random seed for sampling. IfNone, no shuffling is done and instances are taken in order.
- Return type:
- resize(factor, seed=0)[source]¶
Re-size this source by a given factor by sampling instances from it.
- Parameters:
- Return type:
- split(ratio, seed=None)[source]¶
Split this source into two disjoint sources according to the given ratio.
- Parameters:
- Return type:
- class olmo_core.data.composable.InstanceSourceConfig[source]¶
Bases:
ConfigA base config class for configuring and building an
InstanceSource.- abstract build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
- __add__(other)[source]¶
Add two instance source configs together into a
ConcatenatedInstanceSourceConfig.- Return type:
- __mul__(factor)[source]¶
Re-size this source by a given factor by sampling instances from it.
- Return type:
- sample(*, max_tokens=None, max_instances=None, seed=0)[source]¶
Sample instances from this source.
- Parameters:
max_tokens (
Optional[int], default:None) – The maximum number of tokens to sample from this source. Mutually exclusive withmax_instances.max_instances (
Optional[int], default:None) – The maximum number of instances to sample from this source. Mutually exclusive withmax_tokens.seed (
Optional[int], default:0) – A random seed for sampling. IfNone, no shuffling is done and instances are taken in order.
- Return type:
- resize(factor, seed=0)[source]¶
Re-size this source by a given factor by sampling instances from it.
- Parameters:
- Return type:
- split(ratio, seed=None)[source]¶
Split this source into two disjoint sources according to the given ratio.
- Parameters:
- Return type:
- class olmo_core.data.composable.Instance[source]¶
Bases:
TypedDictAn instance is just a dictionary that should include
input_idsand optionally a correspondinglabel_mask.
- class olmo_core.data.composable.ComposableDataLoader(*sources, collator, tokenizer, work_dir, global_batch_size, dp_world_size=1, dp_rank=0, fs_local_rank=None, seed=0, shuffle=True, shuffle_strategy=None, sources_per_epoch=-1, num_threads=None, num_workers=0, prefetch_factor=None, target_device_type='cpu', generate_doc_lengths=False, instance_filter_config=None, display_source_visualization=True)[source]¶
Bases:
TextDataLoaderBaseA data loader for composable instance sources.
- Parameters:
sources (
InstanceSource) – One or more instance sources to draw data from. All sources must have the samesequence_lengthandmax_sequence_length.collator (
DataCollator) – The data collator to use to form batches.tokenizer (
TokenizerConfig) – The config of the tokenizer used to create the underlying data.work_dir (
Union[Path,PathLike,str]) – A common local working directory that can be used for caching.global_batch_size (
int) – The total batch size (in tokens) across all data parallel ranks.dp_world_size (
int, default:1) – The number of data parallel ranks.dp_rank (
int, default:0) – The data parallel rank of the current process.fs_local_rank (
Optional[int], default:None) – The local rank of the current process with respect to filesystem access of the working directory.seed (
int, default:0) – The random seed to use when shuffling data.shuffle (
bool, default:True) – Whether to shuffle data at the start of each epoch.shuffle_strategy (
Optional[ShuffleStrategy], default:None) – How to shuffle the data. Defaults toShuffleStrategy.inter_source.sources_per_epoch (
int, default:-1) – The number of sources to use per epoch. If -1, all sources are used.num_threads (
Optional[int], default:None) – The number of threads to use for loading data within each worker process.num_workers (
int, default:0) – The number of worker processes to use for loading data.prefetch_factor (
Optional[int], default:None) – The number of batches to prefetch from each worker process.target_device_type (
str, default:'cpu') – The type of device that batches will be sent to, typically either “cpu” or “cuda”.generate_doc_lengths (
bool, default:False) – Whether to generate document lengths for each instance needed for intra-document masking.instance_filter_config (
Optional[InstanceFilterConfig], default:None) – Optional configuration for filtering instances based on long sequences of repeated ngrams.display_source_visualization (
bool, default:True) – Whether to display a visualization of each source to stdout from rank 0.
- Config¶
alias of
ComposableDataLoaderConfig
- property total_batches: int | None¶
The total number of batches that the dataset will produce over the course of the current epoch, if known. Otherwise this should return
None.
- batches_in_epoch(epoch)[source]¶
By default this is the same as
total_batches(), though some data loaders might generate a different number of batches per epoch.
- load_state_dict(state_dict)[source]¶
Load a state dict from
state_dict()to restore the data loader’s state.
- reshuffle(epoch=None, **kwargs)[source]¶
Reshuffle for a new epoch. Should be called before starting the epoch, regardless of whether or not you’ve called
load_state_dict().
- class olmo_core.data.composable.ComposableDataLoaderConfig(tokenizer=None, global_batch_size=None, seed=<factory>, work_dir=None, shuffle=True, shuffle_strategy=None, sources_per_epoch=-1, num_threads=None, num_workers=0, prefetch_factor=None, target_device_type=None, generate_doc_lengths=False, instance_filter_config=None, display_source_visualization=True, *, type=None)[source]¶
Bases:
DataLoaderConfig[ComposableDataLoader]A configuration class for building
ComposableDataLoaderdata loaders.- build(*sources, collator=None, work_dir=None, mesh=None, dp_process_group=None, tokenizer=None, global_batch_size=None)[source]¶
Construct the
ComposableDataLoader.- Parameters:
sources (
InstanceSource) – The instance sources.collator (
Optional[DataCollator], default:None) – An optional data collator. If not provided, a default will be created.work_dir (
Union[Path,PathLike,str,None], default:None) – A working directory for caching.mesh (
Optional[DeviceMesh], default:None) – An optionalDeviceMeshthat defines the data parallel dimensions. Ideally you should create this mesh usingbuild_world_mesh(). Alternatively you can pass thedp_process_groupinstead.dp_process_group (
Optional[ProcessGroup], default:None) – The data parallel process group.
- Return type:
- registered_base¶
alias of
DataLoaderConfig
- class olmo_core.data.composable.InMemoryTokenSource(tokens, *, work_dir, label_mask=None, label=None)[source]¶
Bases:
TokenSourceAn in-memory implementation of a
TokenSource. Primarily meant for testing.- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.ConcatenatedTokenSource(*sources, work_dir, label=None)[source]¶
Bases:
TokenSourceA token source that can be created from concatenating multiple other token sources.
- Config¶
alias of
ConcatenatedTokenSourceConfig
- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.ConcatenatedTokenSourceConfig(sources, label=None)[source]¶
Bases:
TokenSourceConfigA base config class for configuring and building a
ConcatenatedTokenSource.
- class olmo_core.data.composable.SlicedTokenSource(source, source_slice, *, work_dir, label=None)[source]¶
Bases:
TokenSourceA token source that provides a slice of another token source.
- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.SplitTokenSourceConfig(source, ratio, idx)[source]¶
Bases:
TokenSourceConfigA base config class for configuring and building a split
TokenSource.
- class olmo_core.data.composable.SamplingTokenSource(*sources, max_tokens, seed=0, work_dir, label=None)[source]¶
Bases:
TokenSourceA token source that samples contiguous chunks of tokens from other token sources. This can be used to adjust the effective size of a source.
Tip
Unlike
SamplingDocumentSource, this class doesn’t take document boundaries into account when sampling, but is much faster to set up.- Parameters:
sources (
TokenSource) – The sources to sample tokens from.max_tokens (
int) – The maximum number of tokens to sample.seed (
Optional[int], default:0) – A optional seed for sampling. IfNone, the firstN_stokens are taken from each source whereN_sis proportional to the size of the source.
Warning
Generally you should prefer to use
SamplingDocumentSourcewith random sampling (a seed provided) to preserve the distribution of child sources. This is a quick and dirty alternatively.- Config¶
alias of
SamplingTokenSourceConfig
- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.SamplingTokenSourceConfig(sources, max_tokens=None, factor=None, seed=<factory>, label=None)[source]¶
Bases:
TokenSourceConfigA config for building a
SamplingTokenSource.
- class olmo_core.data.composable.MixingTokenSource(*source_specs, work_dir, seed=0, label=None, num_tokens=None)[source]¶
Bases:
TokenSourceA token source for mixing other token sources together with arbitrary ratios. Sampling within each source is done using
SamplingTokenSource, which samples a consecutive chunk of tokens.See also
MixingDocumentSourcefor mixing document sources by sampling whole documents.MixingInstanceSourcefor mixing instance sources.
Important
Sampling is done in a way that minimizes the number of dropped and repeated tokens while matching the target ratios and respecting the
MixingTokenSourceSpec.max_repetition_factorvalues.If
num_tokensis not specified, then the number of tokens this source produces will always be less than or equal to the sum of tokens across all of its immediate children defined in thesource_specs.If
num_tokensis specified, this class will try to match that size but may raise anOLMoConfigurationErrorif it’s not possible with the givenmax_repetition_factorvalues.- Parameters:
source_specs (
MixingTokenSourceSpec) – The sources and how to sample from them.num_tokens (
Optional[int], default:None) – An optional target number of tokens for the mixed source.
Warning
Generally you should prefer to use
MixingDocumentSourcewith random sampling (a seed provided) to preserve the distribution of child sources. This is a quick and dirty alternatively.- Config¶
alias of
MixingTokenSourceConfig
- Spec¶
alias of
MixingTokenSourceSpec
- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.MixingTokenSourceConfig(source_specs, seed=<factory>, label=None, num_tokens=None)[source]¶
Bases:
TokenSourceConfigA config for
MixingTokenSource.-
source_specs:
List[MixingTokenSourceSpecConfig]¶ Mixing source specs.
-
source_specs:
- class olmo_core.data.composable.InMemoryDocumentSource(tokens, *, tokenizer, work_dir, label_mask=None, label=None)[source]¶
Bases:
InMemoryTokenSource,DocumentSourceAn in-memory implementation of a
DocumentSource. Primarily meant for testing.
- class olmo_core.data.composable.ConcatenatedDocumentSource(*sources, work_dir, label=None)[source]¶
Bases:
ConcatenatedTokenSource,DocumentSourceA document source that can be created from concatenating multiple other document sources.
- Config¶
alias of
ConcatenatedDocumentSourceConfig
- class olmo_core.data.composable.ConcatenatedDocumentSourceConfig(sources, label=None)[source]¶
Bases:
DocumentSourceConfigA base config class for configuring and building a
ConcatenatedDocumentSource.
- class olmo_core.data.composable.SamplingDocumentSource(*sources, max_tokens, seed=0, work_dir, label=None)[source]¶
Bases:
DocumentSourceA document source that samples documents from other document sources. This can be used to adjust the effective size of a source.
- Parameters:
sources (
DocumentSource) – The sources to sample documents from.max_tokens (
int) – The maximum number of tokens to sample. The resulting source will have at most this many tokens, but potentially less because only whole documents are sampled.seed (
Optional[int], default:0) – A optional seed for sampling documents. IfNone, no shuffling is done and the first documents are taken up tomax_tokens.
Warning
It’s recommend to set a seed to ensure that the distribution of documents in child sources are preserved.
- Config¶
alias of
SamplingDocumentSourceConfig
- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.SamplingDocumentSourceConfig(sources, max_tokens=None, factor=None, seed=<factory>, label=None)[source]¶
Bases:
DocumentSourceConfigA config for building a
SamplingDocumentSource.
- class olmo_core.data.composable.MixingDocumentSource(*source_specs, work_dir, seed=0, label=None, num_tokens=None)[source]¶
Bases:
DocumentSourceA document source for mixing other document sources together with arbitrary ratios. Sampling within each source is done using
SamplingDocumentSource, which samples whole documents.See also
MixingTokenSourcefor mixing token sources in a way that’s agnostic of document boundaries.MixingInstanceSourcefor mixing instance sources.
Important
Sampling is done in a way that minimizes the number of dropped and repeated tokens while matching the target ratios and respecting the
MixingDocumentSourceSpec.max_repetition_factorvalues.If
num_tokensis not specified, then the number of tokens this source produces will always be less than or equal to the sum of tokens across all of its immediate children defined in thesource_specs.If
num_tokensis specified, this class will try to match that size but may raise anOLMoConfigurationErrorif it’s not possible with the givenmax_repetition_factorvalues.- Parameters:
source_specs (
MixingDocumentSourceSpec) – The sources and how to sample from them.num_tokens (
Optional[int], default:None) – An optional target number of tokens for the mixed source.
- Config¶
alias of
MixingDocumentSourceConfig
- Spec¶
alias of
MixingDocumentSourceSpec
- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.MixingDocumentSourceConfig(source_specs, seed=<factory>, label=None, num_tokens=None)[source]¶
Bases:
DocumentSourceConfigA config for
MixingDocumentSource.-
source_specs:
List[MixingDocumentSourceSpecConfig]¶ Mixing source specs.
-
source_specs:
- class olmo_core.data.composable.NumpyDocumentSource(*, source_paths, dtype, work_dir, tokenizer, label_mask_paths=None, label=None, max_document_length=None, long_doc_strategy='truncate', _source_sizes=None, _label_mask_sizes=None)[source]¶
Bases:
DocumentSourceA
DocumentSourcethat reads tokens from one or more tokenized numpy source files.Important
There’s some overhead when instantiating this class because it needs to query the sizes of all the source files. If you want to create multiple sources from the same set of files, consider first creating a single source and then splitting it up using
split_by_source(), which will be much more efficient than creating multiple sources directly since the sizes of the source files will only need to be queried once and will be done so concurrently with a thread pool.- Parameters:
source_paths (
Sequence[Union[Path,PathLike,str]]) – The paths/URLs to the numpy token ID arrays.dtype (
Union[Type[uint8],Type[uint16],Type[uint32],Type[uint64]]) – The numpy datatype of the token ID arrays in the source paths.tokenizer (
TokenizerConfig) – The config of the tokenizer that was used to tokenize the source files.label_mask_paths (
Optional[Sequence[Union[Path,PathLike,str]]], default:None) – The paths/URLs to numpy bool files indicating which tokens should be masked.max_document_length (
Optional[int], default:None) – The maximum document length to use when iterating over documents. If notNone, documents longer than this will either be fragmented or truncated depending on the long_doc_strategy`.long_doc_strategy (
LongDocStrategy, default:'truncate') – How to handle long documents whenmax_document_lengthis set.
- Config¶
alias of
NumpyDocumentSourceConfig
- MixConfig¶
alias of
NumpyDocumentSourceMixConfig
- property fingerprint: str¶
A unique, deterministic string representing the ordered contents of the source.
- split_by_source(group_size=1)[source]¶
Split the source up into multiple smaller sources from groups of source files.
- Return type:
- get_token_range(start_idx, end_idx)[source]¶
Get a range of contiguous tokens starting from
start_idx(0-based, inclusive) toend_idx(exclusive).Since a
TokenSourceisn’t necessarily aware of document boundaries (seeDocumentSource), the token range could start in the middle of a document and span multiple documents. It’s up to the consumers of a token source (e.g. anInstanceSource) to get ranges that make sense for their use case.- Return type:
- class olmo_core.data.composable.NumpyDocumentSourceConfigBase(*, tokenizer, dtype=None, source_permutation_seed=None, source_group_size=1, label=None, max_document_length=None, long_doc_strategy='truncate')[source]¶
Bases:
DocumentSourceConfigBase config class for
NumpyDocumentSourceConfigandNumpyDocumentSourceMixConfig.-
tokenizer:
TokenizerConfig¶ The config of the tokenizer that was used to tokenize the source files.
-
dtype:
Optional[NumpyDatasetDType] = None¶ The numpy datatype of the token ID arrays in the source paths.
-
source_permutation_seed:
Optional[int] = None¶ Used to shuffle the source files before grouping/building the document sources.
-
max_document_length:
Optional[int] = None¶ The maximum document length to use when iterating over documents. If not
None, documents longer than this will either be fragmented or truncated depending on the long_doc_strategy`.
-
long_doc_strategy:
LongDocStrategy= 'truncate'¶ How to handle long documents when
max_document_lengthis set.
-
tokenizer:
- class olmo_core.data.composable.NumpyDocumentSourceConfig(*, tokenizer, dtype=None, source_permutation_seed=None, source_group_size=1, label=None, max_document_length=None, long_doc_strategy='truncate', source_paths, label_mask_paths=None, expand_glob=None)[source]¶
Bases:
NumpyDocumentSourceConfigBaseConfig class for building one or more
NumpyDocumentSourcedirectly from source paths.-
label_mask_paths:
Optional[List[str]] = None¶ The paths/URLs to numpy bool files indicating which tokens should be masked.
-
expand_glob:
Optional[bool] = None¶ If true, treat source/label paths as glob patterns and expand them when building the sources.
- classmethod from_source_groups(source_path_groups, *, tokenizer, label_mask_path_groups=None, expand_glob=None, **kwargs)[source]¶
A more efficient way to create multiple configs from groups of source paths. This will use a thread pool to expand all globs concurrently, which can be substantially faster especially when some of the globs point to cloud storage URLs.
- Parameters:
source_path_groups (
Dict[str,List[Union[Path,PathLike,str]]]) – Groups of source paths to use. Each group will be put into its own config with the corresponding label.tokenizer (
TokenizerConfig) – The tokenizer config to use.label_mask_path_groups (
Optional[Dict[str,List[Union[Path,PathLike,str]]]], default:None) – Optional groups of label mask paths to use. Each group should correspond to the group insource_pathsat the same key.
- Return type:
- build(work_dir)[source]¶
Build the sources. :rtype:
List[NumpyDocumentSource]Note
The number of sources returned depends on the length of
source_pathsand the value ofsource_group_size.
-
label_mask_paths:
- class olmo_core.data.composable.NumpyDocumentSourceMixConfig(*, tokenizer, dtype=None, source_permutation_seed=None, source_group_size=1, label=None, max_document_length=None, long_doc_strategy='truncate', mix, mix_base_dir)[source]¶
Bases:
NumpyDocumentSourceConfigBaseConfig class for building one or more
NumpyDocumentSourcefrom a predefined source mix.-
mix:
Union[str,DataMixBase]¶ The name of a data mix (e.g.
"dolma17").
- build(work_dir)[source]¶
Build the sources. :rtype:
List[NumpyDocumentSource]Note
The number of sources returned depends on the number of paths in the mix and the value of
source_group_size.
-
mix:
- class olmo_core.data.composable.ConcatAndChunkInstanceSource(*sources, sequence_length, work_dir, max_sequence_length=None, label=None)[source]¶
Bases:
InstanceSourceThe basic instance source that simply chunks up token sources without regard for document boundaries, just like the
NumpyFSLDataset.- Config¶
alias of
ConcatAndChunkInstanceSourceConfig
- class olmo_core.data.composable.ConcatAndChunkInstanceSourceConfig(sources, sequence_length, max_sequence_length=None, label=None)[source]¶
Bases:
InstanceSourceConfigConfig for
ConcatAndChunkInstanceSource.- classmethod from_npy(*npy_paths, tokenizer, sequence_length, max_sequence_length=None, dtype=None, source_permutation_seed=None, source_group_size=1, label_mask_paths=None, expand_glob=None, label=None)[source]¶
Create a
ConcatAndChunkInstanceSourceConfigfrom one or more tokenized.npysource files.- Return type:
- build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
- class olmo_core.data.composable.PackingInstanceSource(*sources, sequence_length, work_dir, tokenizer, max_sequence_length=None, long_doc_strategy='truncate', source_group_size=1, label=None)[source]¶
Bases:
InstanceSourceLike the
NumpyPackedFSLDataset, this instance source packs documents from eachDocumentSourceinto instances using the Optimized Best-Fit Decreasing (OBFD) algorithm described in Fewer Truncations Improve Language Modeling. The resulting instances will all have exactlysequence_lengthtokens, using padding if needed.Note
By default OBFD is applied to each source separately since source files from the Dolma toolkit are usually large enough for OBFD to achieve very good compactness (minimal padding tokens) and so that we can parallelize the packing. However, you can pack instances from multiple consecutive sources together by setting
source_group_sizeto a value greater than 1.- Parameters:
sources (
DocumentSource) – Sources of documents to pack.sequence_length (
int) – The sequence length of each instance, i.e. the maximum number of tokens that can be packed into each instance.tokenizer (
TokenizerConfig) – The tokenizer configuration.max_sequence_length (
Optional[int], default:None) – This must be equal tosequence_lengthif given.long_doc_strategy (
LongDocStrategy, default:'truncate') – The strategy to use for documents longer thansequence_length.source_group_size (
int, default:1) – The number of consecutive sources to pack together.
- Config¶
alias of
PackingInstanceSourceConfig
- class olmo_core.data.composable.PackingInstanceSourceConfig(sources, sequence_length, tokenizer, max_sequence_length=None, long_doc_strategy='truncate', source_group_size=1, label=None)[source]¶
Bases:
InstanceSourceConfigConfig for
PackingInstanceSource.- classmethod from_npy(*npy_paths, tokenizer, sequence_length, max_sequence_length=None, dtype=None, source_permutation_seed=None, source_group_size=1, label_mask_paths=None, expand_glob=None, label=None, long_doc_strategy='truncate')[source]¶
Create a
PackingInstanceSourceConfigfrom one or more tokenized.npysource files.- Return type:
- build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
- class olmo_core.data.composable.ConcatenatedInstanceSource(*sources, work_dir, label=None)[source]¶
Bases:
InstanceSourceAn instance source that concatenates multiple instance sources together end-to-end.
- Config¶
alias of
ConcatenatedInstanceSourceConfig
- class olmo_core.data.composable.ConcatenatedInstanceSourceConfig(sources)[source]¶
Bases:
InstanceSourceConfigA config for a
ConcatenatedInstanceSource.- build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
- class olmo_core.data.composable.SlicedInstanceSource(source, source_slice, *, seed=None, work_dir)[source]¶
Bases:
InstanceSourceAn instance source that provides a slice of another instance source.
- class olmo_core.data.composable.SplitInstanceSourceConfig(source, ratio, idx, seed=None)[source]¶
Bases:
InstanceSourceConfigA base config class for configuring and building a split
InstanceSource.- build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
- class olmo_core.data.composable.SamplingInstanceSource(*sources, max_tokens=None, max_instances=None, work_dir, seed=0, label=None)[source]¶
Bases:
InstanceSourceAn instance source that samples instances from other instance sources. This can be used to adjust the effective size of a source.
- Parameters:
sources (
InstanceSource) – The sources to sample instances from.max_tokens (
Optional[int], default:None) – The maximum number of tokens to sample. Alternatively you can specifymax_instances.max_instances (
Optional[int], default:None) – The maximum number of instances to sample. Mutually exclusive withmax_tokens.seed (
Optional[int], default:0) – A optional seed for sampling. IfNone, the firstN_sinstances are taken from each source whereN_sis proportional to the size of the source.
Warning
It’s recommend to set a seed to ensure that the distribution of instances in child sources are preserved.
- Config¶
alias of
SamplingInstanceSourceConfig
- class olmo_core.data.composable.SamplingInstanceSourceConfig(sources, max_tokens=None, max_instances=None, factor=None, seed=<factory>, label=None)[source]¶
Bases:
InstanceSourceConfigConfig for
SamplingInstanceSource.- build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
- class olmo_core.data.composable.MixingInstanceSource(*source_specs, work_dir, seed=0, label=None, num_tokens=None, num_instances=None)[source]¶
Bases:
InstanceSourceAn instance source for mixing other instance sources together with arbitrary ratios. Sampling within each source is done using
SamplingInstanceSource, which samples whole instances.See also
MixingTokenSourcefor mixing token sources in a way that’s agnostic of document boundaries.MixingDocumentSourcefor mixing document sources by sampling whole documents.
Important
Sampling is done in a way that minimizes the number of dropped instances while matching the target ratios and respecting the
MixingInstanceSourceSpec.max_repetition_factorvalues.If neither
num_tokensnornum_instancesis specified, then the number of instances this source produces will always be less than or equal to the sum of instances across all of its immediate children defined in thesource_specs.If
num_tokensornum_instancesis specified, this class will try to match that size but may raise anOLMoConfigurationErrorif it’s not possible with the givenmax_repetition_factorvalues.- Parameters:
source_specs (
MixingInstanceSourceSpec) – The sources and how to sample from them.num_tokens (
Optional[int], default:None) – An optional target number of tokens for the mixed source. Mutually exclusive withnum_instances.num_instances (
Optional[int], default:None) – An optional target number of instances for the mixed source. Mutually exclusive withnum_tokens.
- Config¶
alias of
MixingInstanceSourceConfig
- Spec¶
alias of
MixingInstanceSourceSpec
- class olmo_core.data.composable.MixingInstanceSourceConfig(source_specs, seed=<factory>, label=None, num_tokens=None, num_instances=None)[source]¶
Bases:
InstanceSourceConfigA config for
MixingInstanceSource.-
source_specs:
List[MixingInstanceSourceSpecConfig]¶ Mixing source specs.
- build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
-
source_specs:
- class olmo_core.data.composable.RandomInstanceSource(*, tokenizer, sequence_length, avg_document_length, seed=0, num_instances=None, num_tokens=None, max_sequence_length=None, label=None, work_dir)[source]¶
Bases:
InstanceSourceAn instance source that generates random instances. Useful for benchmarking.
- Config¶
alias of
RandomInstanceSourceConfig
- class olmo_core.data.composable.RandomInstanceSourceConfig(tokenizer, sequence_length, avg_document_length, seed=<factory>, num_instances=None, num_tokens=None, max_sequence_length=None, label=None)[source]¶
Bases:
InstanceSourceConfigConfig for
RandomInstanceSource.- build(work_dir)[source]¶
Build the
InstanceSource.- Return type:
- class olmo_core.data.composable.InstanceFilterConfig(repetition_max_period=13, repetition_min_period=1, repetition_max_count=32)[source]¶
Bases:
ConfigConfig for instance filtering.
- class olmo_core.data.composable.LongDocStrategy(value)[source]¶
Bases:
StrEnumSpecifies how to handle documents that are longer than the max sequence length when packing.
- truncate = 'truncate'¶
Long docs are truncated and the excess tokens are discarded.
- fragment = 'fragment'¶
Long docs are split into smaller docs so that no tokens are discarded, but you end up with fragmented docs.
- class olmo_core.data.composable.ShuffleStrategy(value)[source]¶
Bases:
StrEnumDefines how the data is shuffled.
- inter_source = 'inter_source'¶
Shuffle across all sources as if they were one big source.
- intra_source = 'intra_source'¶
Shuffle within each source, then concatenate the sources in order. This can be used to create a data curriculum.
- interleaved_source = 'interleaved_source'¶
Shuffle within each source and then interleave instances from each source.
- class olmo_core.data.composable.MixingInstanceSourceSpec(source, ratio, max_repetition_factor=1.0, label=None)[source]¶
Bases:
objectDefines a source and its associated mixing ratio for
MixingInstanceSource.- Config¶
alias of
MixingInstanceSourceSpecConfig
-
source:
InstanceSource¶ The source.
-
ratio:
float¶ The relative target ratio for this source. If the ratios across all source specs don’t sum to 1.0 then they’ll be normalized.
- class olmo_core.data.composable.MixingInstanceSourceSpecConfig(source, ratio, max_repetition_factor=1.0, label=None)[source]¶
Bases:
ConfigConfig for
MixingInstanceSourceSpec.
- class olmo_core.data.composable.MixingTokenSourceSpec(source, ratio, max_repetition_factor=1.0, label=None)[source]¶
Bases:
objectDefines a source and its associated mixing ratio for
MixingTokenSource.- Config¶
alias of
MixingTokenSourceSpecConfig
-
source:
TokenSource¶ The source.
-
ratio:
float¶ The relative target ratio for this source. If the ratios across all source specs don’t sum to 1.0 then they’ll be normalized.
- class olmo_core.data.composable.MixingTokenSourceSpecConfig(source, ratio, max_repetition_factor=1.0, label=None)[source]¶
Bases:
ConfigConfig for
MixingTokenSourceSpec.
- class olmo_core.data.composable.MixingDocumentSourceSpec(source, ratio, max_repetition_factor=1.0, label=None)[source]¶
Bases:
objectDefines a source and its associated mixing ratio for
MixingDocumentSource.- Config¶
alias of
MixingDocumentSourceSpecConfig
-
source:
DocumentSource¶ The source.
-
ratio:
float¶ The relative target ratio for this source. If the ratios across all source specs don’t sum to 1.0 then they’ll be normalized.
- class olmo_core.data.composable.MixingDocumentSourceSpecConfig(source, ratio, max_repetition_factor=1.0, label=None)[source]¶
Bases:
ConfigConfig for
MixingDocumentSourceSpec.