Feature/tvd mi metric #1080

zrobertson466920 · 2025-11-22T00:27:25Z

Add `tvd_mi` metric (LLM-as-a-judge, corpus-level)

Summary
Introduce a new corpus-level metric tvd_mi into lighteval. This metric implements the TVD-MI approach from the paper Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes. It estimates a lower bound on total variation mutual information between model responses by using paired responses and an LLM critic.

Implementation

Sample-level judge

Adds JudgeLLMTVDMI (subclass of JudgeLLM) configured with gpt-4o-2024-08-06 via the openai backend.
Implements prompt generation via get_judge_prompt_tvdmi(response_a, response_b, ...) which asks the judge to distinguish A: SAME TASK/SOURCE vs B: DIFFERENT TASK/SOURCE.
Adds process_judge_response_tvdmi(...) to map responses → binary predictions: A → 1, B → 0; case/whitespace normalized; unknown → fallback 0 with warning.

Corpus-level aggregation

Adds CorpusLevelTVDMI, which accepts sample-dicts of the form { "label": 0 or 1, "pred": 0 or 1, … }.
Computes:


TVD_MI = TPR + TNR − 1

where TPR = P(pred=1 | label=1) and TNR = P(pred=0 | label=0).

If either class is missing (no label=1 or no label=0) → returns NaN.

Metric registration

Extends Metrics enum with:
metric_name = "tvd_mi"
sample_level_fn = JudgeLLMTVDMI()
corpus_level_fn = CorpusLevelTVDMI()
category = SamplingMethod.GENERATIVE
higher_is_better = True

Tests

New file: tests/unit/metrics/test_tvd_mi.py, covering:

Prompt injection & structure checks
Response parser normalization and mapping tests
Corpus-level correctness: perfect critic → ~1.0, random critic → ~0.0, missing-class → NaN
Judge computation wiring via monkey-patching (no actual API calls) verifying keys & labels

Documentation

Updates metric list (LLM-as-Judge section) with:

tvd_mi: Corpus-level LLM-as-a-judge metric that estimates a lower bound on total variation mutual information using paired responses. Assumes each example has two responses and a binary label (1 = same item, 0 = different), and computes TPR + TNR − 1.

Usage

To enable the metric in a task config:

metrics:
- name: tvd_mi

Assumes the task formatter yields docs with:

response_a: str
response_b: str
pair_label: int (1 or 0)

Validation

Unit tests passed:

pytest tests/unit/metrics/test_tvd_mi.py -q
pytest tests/unit/metrics -q \
    --ignore=tests/unit/metrics/test_metric_requests.py \
    -k "not extractiveness"

Manual smoke test performed locally with synthetic pairs and live judge: sample outputs and corpus result behaved as expected.

Ready for review.
Feedback welcome on naming, default judge model, and placement within the metrics taxonomy.

zrobertson466920 added 5 commits November 19, 2025 18:02

add tvd-mi prompt + parser

4db51f7

implement judgellmtvdmi and aggregator

3cb0dfc

corpus aggregator + register metric + sanity test

bfa6e65

add unit-testing + response normalization

81ec63c

Document tvd_mi metric

0febc81

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/tvd mi metric #1080

Feature/tvd mi metric #1080

zrobertson466920 commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feature/tvd mi metric #1080

Are you sure you want to change the base?

Feature/tvd mi metric #1080

Conversation

zrobertson466920 commented Nov 22, 2025

Add tvd_mi metric (LLM-as-a-judge, corpus-level)

Implementation

Tests

Documentation

Usage

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add `tvd_mi` metric (LLM-as-a-judge, corpus-level)