data.tokenizer¶
- class olmo_core.data.tokenizer.TokenizerConfig(vocab_size, eos_token_id, pad_token_id, bos_token_id=None, identifier=None)[source]¶
Bases:
ConfigA configuration class that represents a tokenizer.
-
identifier:
Optional[str] = None¶ The identifier of the tokenizer. Could be a path or HuggingFace identifier.
- padded_vocab_size(pad_multiple=128)[source]¶
Returns the vocab size padded to be a multiple of
pad_multiple. This is useful to set model embeddings to this number to increase throughput.- Return type:
- classmethod dolma2_sigdig()[source]¶
Get a
dolma2_sigdigtokenizer config.- Return type:
- classmethod gpt_neox_olmo_dolma_v1_5()[source]¶
Get a
gpt_neox_olmo_dolma_v1_5tokenizer config.- Return type:
-
identifier:
- class olmo_core.data.tokenizer.TokenizerName(value)[source]¶
Bases:
StrEnumAn enumeration of tokenizer identifiers commonly used OLMo researchers.
- dolma2 = 'allenai/dolma2-tokenizer'¶
The dolma2 tokenizer.
- dolma2_sigdig = 'allenai/dolma2-tokenizer-sigdig'¶
The R2L dolma2 tokenizer.
- gpt_neox_olmo_dolma_v1_5 = 'allenai/gpt-neox-olmo-dolma-v1_5'¶
A modified GPT NeoX tokenizer.
- gpt2 = 'gpt2'¶
The base GPT2 tokenizer.