Native Generation and Chat¶
OLMo-core includes a native generation module for autoregressive text generation with transformer models. This guide covers how to load a model from a checkpoint, generate text programmatically, and use the built-in interactive chat interface.
Loading a model from a checkpoint¶
The simplest way to get started is with from_checkpoint(),
which loads a transformer model and its weights from a checkpoint directory:
from olmo_core.generate import GenerationConfig, TransformerGenerationModule
generation_config = GenerationConfig(
pad_token_id=0,
eos_token_id=1,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
)
generation_module = TransformerGenerationModule.from_checkpoint(
checkpoint_dir="path/to/checkpoint",
generation_config=generation_config,
)
The checkpoint must contain a config.json with the model architecture (model key) and tokenizer
config (dataset.tokenizer key). If your checkpoint doesn’t include a config.json, you can pass
the TransformerConfig explicitly:
from olmo_core.nn.transformer import TransformerConfig
generation_module = TransformerGenerationModule.from_checkpoint(
checkpoint_dir="path/to/checkpoint",
transformer_config=TransformerConfig.olmo2_7B(),
generation_config=generation_config,
)
You can also control the model dtype and attention backend:
from olmo_core.config import DType
from olmo_core.nn.attention import AttentionBackendName
generation_module = TransformerGenerationModule.from_checkpoint(
checkpoint_dir="path/to/checkpoint",
generation_config=generation_config,
dtype=DType.bfloat16,
attention_backend=AttentionBackendName.torch,
)
Generating text¶
Use generate_batch() to generate token IDs from
input prompts:
import torch
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/dolma2-tokenizer")
# Encode a prompt
input_ids = tokenizer.encode("The capital of France is", return_tensors="pt")
# Generate
generated_ids, logits, logprobs = generation_module.generate_batch(
input_ids,
completions_only=True,
)
# Decode
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(text)
generate_batch returns a tuple of (generated_ids, logits, logprobs). By default logits
and logprobs are None; set return_logits=True or return_logprobs=True to include them.
Setting completions_only=True returns only the newly generated tokens (excluding the prompt).
Batched generation¶
generate_batch accepts batched inputs. When prompts have different lengths, use left-padding and
pass an attention mask:
prompts = ["Hello, world!", "The quick brown fox"]
encoded = tokenizer(
prompts,
return_tensors="pt",
padding=True,
padding_side="left",
)
generated_ids, _, _ = generation_module.generate_batch(
encoded["input_ids"],
attention_mask=encoded["attention_mask"],
completions_only=True,
)
for i, ids in enumerate(generated_ids):
print(f"Prompt: {prompts[i]}")
print(f"Output: {tokenizer.decode(ids, skip_special_tokens=True)}")
Generation configuration¶
GenerationConfig controls how tokens are selected. Key parameters:
max_new_tokens/max_length– limits the number of generated tokens.do_sample– set toFalsefor greedy (deterministic) decoding.temperature– higher values produce more random outputs;0.0is equivalent to greedy.top_k– restrict sampling to the top-k highest-probability tokens (-1disables).top_p– nucleus sampling; only consider tokens whose cumulative probability exceeds this threshold.use_cache– enable KV-cache for faster autoregressive decoding (enabled by default).stop_token_ids– additional token IDs (beyond EOS) that stop generation.
You can override any generation parameter per call:
# Greedy decoding for this call only
generated_ids, _, _ = generation_module.generate_batch(
input_ids,
do_sample=False,
)
Using the config-based API¶
For more structured setups, use TransformerGenerationModuleConfig:
from olmo_core.generate import GenerationConfig, TransformerGenerationModuleConfig
config = TransformerGenerationModuleConfig(
generation_config=GenerationConfig(
pad_token_id=0,
eos_token_id=1,
max_new_tokens=512,
temperature=0.8,
top_p=0.95,
),
compile_model=True,
)
generation_module = config.build(
checkpoint_dir="path/to/checkpoint",
)
Merging multiple checkpoints¶
from_checkpoints() averages the weights from
multiple checkpoints before creating the generation module:
generation_module = TransformerGenerationModule.from_checkpoints(
checkpoint_dirs=[
"path/to/checkpoint1",
"path/to/checkpoint2",
"path/to/checkpoint3",
],
generation_config=generation_config,
)
Interactive chat interface¶
OLMo-core ships with a CLI chatbot that wraps the generation module in an interactive loop with conversation history and chat template support.
Basic usage¶
python -m olmo_core.generate.chat path/to/checkpoint
This loads the model, auto-detects the tokenizer from the checkpoint’s config.json, and starts
an interactive prompt.
Running on Mac (no Flash Attention)¶
If you’re running on a Mac (e.g. Apple Silicon) without Flash Attention, use the torch attention
backend and disable the KV cache:
python -m olmo_core.generate.chat path/to/checkpoint \
--attention-backend torch --no-use-cache
You can also use a public checkpoint URL directly:
python -m olmo_core.generate.chat \
https://olmo-checkpoints.org/ai2-llm/Olmo-3-1025-7B/stage2/step47684/ \
--attention-backend torch --no-use-cache
Customizing generation parameters¶
python -m olmo_core.generate.chat path/to/checkpoint \
--max-new-tokens 512 \
--temperature 0.7 \
--top-p 0.9
# Greedy decoding
python -m olmo_core.generate.chat path/to/checkpoint \
--no-do-sample
Chat templates¶
By default the chat interface concatenates messages without any special formatting. For models
trained with a chat template (e.g. instruction-tuned models), pass a Jinja2 template string
via --chat-template:
python -m olmo_core.generate.chat path/to/checkpoint \
--chat-template "{% for message in messages %}<|{{ message['role'] }}|>{{ message['content'] }}{% endfor %}<|assistant|>"
System prompts¶
Provide a system prompt that is prepended to every conversation:
python -m olmo_core.generate.chat path/to/checkpoint \
--system-prompt "You are a helpful assistant."
In-chat commands¶
While in the chat session:
/quitor/exit– exit the chatbot/clear– clear conversation history/help– show help
All CLI options¶
Flag |
Default |
Description |
|---|---|---|
|
(required) |
Path or URL to model checkpoint |
|
1024 |
Maximum tokens to generate per turn |
|
None |
Maximum total length (prompt + generation); overrides |
|
1.0 |
Sampling temperature |
|
-1 |
Top-k filtering (-1 = disabled) |
|
0.7 |
Nucleus sampling threshold |
|
True |
Enable/disable sampling |
|
True |
Enable/disable KV cache |
|
auto |
Attention backend (e.g. |
|
bfloat16 |
Model parameter dtype |
|
None |
System prompt for the conversation |
|
False |
Show special tokens in output |
|
concatenate |
Jinja2 chat template string |
|
WARNING |
Logging level |