eval.evaluator

class olmo_core.eval.evaluator.Evaluator(*, name, batches=None, batches_factory=None, device=None, deterministic=True)[source]

Bases: object

Base class for in-loop evaluators.

See also

This can be used with an EvaluatorCallback to run an evaluator within the training loop.

Parameters:
  • name (str) – A name to assign to the evaluator.

  • batches (Optional[Iterable[Dict[str, Any]]], default: None) – Generates batches for the evaluator. These should at least include the “input_ids” field, but can contain any other arbitrary fields as well.

  • batches_factory (Optional[Callable[[], Iterable[Dict[str, Any]]]], default: None) – A callable that returns an iterable over batches. This is an alternative to providing the batches argument directly.

  • device (Optional[device], default: None) – The device to compute/reduce metrics on.

  • deterministic (bool, default: True) – When True and batches is a DataLoaderBase, each evaluation pass resets the data loader and reshuffles with epoch=1 so repeated evals read the same batches in the same order. This is useful when eval loops are truncated via Duration. When False, the data loader still resets to batch 0 before each pass, but reshuffles without pinning the epoch so the batch order may change between eval runs. This does not implement a moving window across evals; if an eval is truncated, different reshuffles may result in different instances being evaluated each time.

property total_batches: int | None

Get the total number of batches in an eval loop if it’s known ahead of time.

abstract update_metrics(batch, ce_loss, logits)[source]

Update metrics with from the batch just processed and the corresponding logits.

Parameters:
  • batch (Dict[str, Any]) – A batch generated from batches.

  • ce_loss (Optional[Tensor]) – The cross-entropy loss per token (un-reduced) of the batch. This will have shape (batch_size, (seq_len - 1)).

  • logits (Optional[Tensor]) – The logits generated from the forward pass of the model.

Return type:

None

abstract compute_metrics()[source]

Compute the final value of the metrics for the current evaluation loop. The metrics returned should already be reduced, if needed.

Return type:

Dict[str, Tensor]

abstract reset_metrics()[source]

Reset metrics. Should be called after compute_metrics().

Return type:

None