data.tokenizer

class olmo_core.data.tokenizer.TokenizerConfig(vocab_size, eos_token_id, pad_token_id, bos_token_id=None, identifier=None)[source]

Bases: Config

A configuration class that represents a tokenizer.

vocab_size: int

The vocab size.

eos_token_id: int

The end-of-sentence token ID.

pad_token_id: int

The padding token ID.

bos_token_id: Optional[int] = None

The begin-of-sentence token ID.

identifier: Optional[str] = None

The identifier of the tokenizer. Could be a path or HuggingFace identifier.

padded_vocab_size(pad_multiple=128)[source]

Returns the vocab size padded to be a multiple of pad_multiple. This is useful to set model embeddings to this number to increase throughput.

Return type:

int

classmethod dolma2()[source]

Get a dolma2 tokenizer config.

Return type:

TokenizerConfig

classmethod dolma2_sigdig()[source]

Get a dolma2_sigdig tokenizer config.

Return type:

TokenizerConfig

classmethod gpt_neox_olmo_dolma_v1_5()[source]

Get a gpt_neox_olmo_dolma_v1_5 tokenizer config.

Return type:

TokenizerConfig

classmethod gpt2()[source]

Get a gpt2 tokenizer config.

Return type:

TokenizerConfig

classmethod from_hf(identifier)[source]

Initialize a tokenizer config from a model on HuggingFace.

Parameters:

identifier (str) – The HF model identifier, e.g. “meta-llama/Llama-3.2-1B”.

Return type:

TokenizerConfig

class olmo_core.data.tokenizer.TokenizerLike(*args, **kwargs)[source]

Bases: Protocol

class olmo_core.data.tokenizer.TokenizerName(value)[source]

Bases: StrEnum

An enumeration of tokenizer identifiers commonly used OLMo researchers.

dolma2 = 'allenai/dolma2-tokenizer'

The dolma2 tokenizer.

dolma2_sigdig = 'allenai/dolma2-tokenizer-sigdig'

The R2L dolma2 tokenizer.

gpt_neox_olmo_dolma_v1_5 = 'allenai/gpt-neox-olmo-dolma-v1_5'

A modified GPT NeoX tokenizer.

gpt2 = 'gpt2'

The base GPT2 tokenizer.