Native Generation and Chat¶

OLMo-core includes a native generation module for autoregressive text generation with transformer models. This guide covers how to load a model from a checkpoint, generate text programmatically, and use the built-in interactive chat interface.

Loading a model from a checkpoint¶

The simplest way to get started is with from_checkpoint(), which loads a transformer model and its weights from a checkpoint directory:

from olmo_core.generate import GenerationConfig, TransformerGenerationModule

generation_config = GenerationConfig(
    pad_token_id=0,
    eos_token_id=1,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
)

generation_module = TransformerGenerationModule.from_checkpoint(
    checkpoint_dir="path/to/checkpoint",
    generation_config=generation_config,
)

The checkpoint must contain a config.json with the model architecture (model key) and tokenizer config (dataset.tokenizer key). If your checkpoint doesn’t include a config.json, you can pass the TransformerConfig explicitly:

from olmo_core.nn.transformer import TransformerConfig

generation_module = TransformerGenerationModule.from_checkpoint(
    checkpoint_dir="path/to/checkpoint",
    transformer_config=TransformerConfig.olmo2_7B(),
    generation_config=generation_config,
)

You can also control the model dtype and attention backend:

from olmo_core.config import DType
from olmo_core.nn.attention import AttentionBackendName

generation_module = TransformerGenerationModule.from_checkpoint(
    checkpoint_dir="path/to/checkpoint",
    generation_config=generation_config,
    dtype=DType.bfloat16,
    attention_backend=AttentionBackendName.torch,
)

Generating text¶

Use generate_batch() to generate token IDs from input prompts:

import torch
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/dolma2-tokenizer")

# Encode a prompt
input_ids = tokenizer.encode("The capital of France is", return_tensors="pt")

# Generate
generated_ids, logits, logprobs = generation_module.generate_batch(
    input_ids,
    completions_only=True,
)

# Decode
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(text)

generate_batch returns a tuple of (generated_ids, logits, logprobs). By default logits and logprobs are None; set return_logits=True or return_logprobs=True to include them.

Setting completions_only=True returns only the newly generated tokens (excluding the prompt).

Batched generation¶

generate_batch accepts batched inputs. When prompts have different lengths, use left-padding and pass an attention mask:

prompts = ["Hello, world!", "The quick brown fox"]
encoded = tokenizer(
    prompts,
    return_tensors="pt",
    padding=True,
    padding_side="left",
)

generated_ids, _, _ = generation_module.generate_batch(
    encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    completions_only=True,
)

for i, ids in enumerate(generated_ids):
    print(f"Prompt: {prompts[i]}")
    print(f"Output: {tokenizer.decode(ids, skip_special_tokens=True)}")

Generation configuration¶

GenerationConfig controls how tokens are selected. Key parameters:

max_new_tokens / max_length – limits the number of generated tokens.
do_sample – set to False for greedy (deterministic) decoding.
temperature – higher values produce more random outputs; 0.0 is equivalent to greedy.
top_k – restrict sampling to the top-k highest-probability tokens (-1 disables).
top_p – nucleus sampling; only consider tokens whose cumulative probability exceeds this threshold.
use_cache – enable KV-cache for faster autoregressive decoding (enabled by default).
stop_token_ids – additional token IDs (beyond EOS) that stop generation.

You can override any generation parameter per call:

# Greedy decoding for this call only
generated_ids, _, _ = generation_module.generate_batch(
    input_ids,
    do_sample=False,
)

Using the config-based API¶

For more structured setups, use TransformerGenerationModuleConfig:

from olmo_core.generate import GenerationConfig, TransformerGenerationModuleConfig

config = TransformerGenerationModuleConfig(
    generation_config=GenerationConfig(
        pad_token_id=0,
        eos_token_id=1,
        max_new_tokens=512,
        temperature=0.8,
        top_p=0.95,
    ),
    compile_model=True,
)

generation_module = config.build(
    checkpoint_dir="path/to/checkpoint",
)

Merging multiple checkpoints¶

from_checkpoints() averages the weights from multiple checkpoints before creating the generation module:

generation_module = TransformerGenerationModule.from_checkpoints(
    checkpoint_dirs=[
        "path/to/checkpoint1",
        "path/to/checkpoint2",
        "path/to/checkpoint3",
    ],
    generation_config=generation_config,
)

Interactive chat interface¶

OLMo-core ships with a CLI chatbot that wraps the generation module in an interactive loop with conversation history and chat template support.

Basic usage¶

python -m olmo_core.generate.chat path/to/checkpoint

This loads the model, auto-detects the tokenizer from the checkpoint’s config.json, and starts an interactive prompt.

Running on Mac (no Flash Attention)¶

If you’re running on a Mac (e.g. Apple Silicon) without Flash Attention, use the torch attention backend and disable the KV cache:

python -m olmo_core.generate.chat path/to/checkpoint \
    --attention-backend torch --no-use-cache

You can also use a public checkpoint URL directly:

python -m olmo_core.generate.chat \
    https://olmo-checkpoints.org/ai2-llm/Olmo-3-1025-7B/stage2/step47684/ \
    --attention-backend torch --no-use-cache

Customizing generation parameters¶

python -m olmo_core.generate.chat path/to/checkpoint \
    --max-new-tokens 512 \
    --temperature 0.7 \
    --top-p 0.9

# Greedy decoding
python -m olmo_core.generate.chat path/to/checkpoint \
    --no-do-sample

Chat templates¶

By default the chat interface concatenates messages without any special formatting. For models trained with a chat template (e.g. instruction-tuned models), pass a Jinja2 template string via --chat-template:

python -m olmo_core.generate.chat path/to/checkpoint \
    --chat-template "{% for message in messages %}<|{{ message['role'] }}|>{{ message['content'] }}{% endfor %}<|assistant|>"

System prompts¶

Provide a system prompt that is prepended to every conversation:

python -m olmo_core.generate.chat path/to/checkpoint \
    --system-prompt "You are a helpful assistant."

In-chat commands¶

While in the chat session:

/quit or /exit – exit the chatbot
/clear – clear conversation history
/help – show help

All CLI options¶

Flag	Default	Description
`checkpoint_dir`	(required)	Path or URL to model checkpoint
`--max-new-tokens`	1024	Maximum tokens to generate per turn
`--max-length`	None	Maximum total length (prompt + generation); overrides `--max-new-tokens`
`--temperature`	1.0	Sampling temperature
`--top-k`	-1	Top-k filtering (-1 = disabled)
`--top-p`	0.7	Nucleus sampling threshold
`--do-sample / --no-do-sample`	True	Enable/disable sampling
`--use-cache / --no-use-cache`	True	Enable/disable KV cache
`--attention-backend`	auto	Attention backend (e.g. `torch`, `flash`)
`--dtype`	bfloat16	Model parameter dtype
`--system-prompt`	None	System prompt for the conversation
`--show-special-tokens`	False	Show special tokens in output
`--chat-template`	concatenate	Jinja2 chat template string
`--verbosity`	WARNING	Logging level