Skip to content

Conversation

@zrobertson466920
Copy link

Add tvd_mi metric (LLM-as-a-judge, corpus-level)

Summary
Introduce a new corpus-level metric tvd_mi into lighteval. This metric implements the TVD-MI approach from the paper Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes. It estimates a lower bound on total variation mutual information between model responses by using paired responses and an LLM critic.

Implementation

Sample-level judge

  • Adds JudgeLLMTVDMI (subclass of JudgeLLM) configured with gpt-4o-2024-08-06 via the openai backend.
  • Implements prompt generation via get_judge_prompt_tvdmi(response_a, response_b, ...) which asks the judge to distinguish A: SAME TASK/SOURCE vs B: DIFFERENT TASK/SOURCE.
  • Adds process_judge_response_tvdmi(...) to map responses → binary predictions: A1, B0; case/whitespace normalized; unknown → fallback 0 with warning.

Corpus-level aggregation

  • Adds CorpusLevelTVDMI, which accepts sample-dicts of the form { "label": 0 or 1, "pred": 0 or 1, … }.
  • Computes:

TVD_MI = TPR + TNR − 1

where TPR = P(pred=1 | label=1) and TNR = P(pred=0 | label=0).

  • If either class is missing (no label=1 or no label=0) → returns NaN.

Metric registration

  • Extends Metrics enum with:
  • metric_name = "tvd_mi"
  • sample_level_fn = JudgeLLMTVDMI()
  • corpus_level_fn = CorpusLevelTVDMI()
  • category = SamplingMethod.GENERATIVE
  • higher_is_better = True

Tests

New file: tests/unit/metrics/test_tvd_mi.py, covering:

  • Prompt injection & structure checks
  • Response parser normalization and mapping tests
  • Corpus-level correctness: perfect critic → ~1.0, random critic → ~0.0, missing-class → NaN
  • Judge computation wiring via monkey-patching (no actual API calls) verifying keys & labels

Documentation

Updates metric list (LLM-as-Judge section) with:

tvd_mi: Corpus-level LLM-as-a-judge metric that estimates a lower bound on total variation mutual information using paired responses. Assumes each example has two responses and a binary label (1 = same item, 0 = different), and computes TPR + TNR − 1.

Usage

To enable the metric in a task config:

metrics:
- name: tvd_mi

Assumes the task formatter yields docs with:

  • response_a: str
  • response_b: str
  • pair_label: int (1 or 0)

Validation

  1. Unit tests passed:
pytest tests/unit/metrics/test_tvd_mi.py -q
pytest tests/unit/metrics -q \
    --ignore=tests/unit/metrics/test_metric_requests.py \
    -k "not extractiveness"
  1. Manual smoke test performed locally with synthetic pairs and live judge: sample outputs and corpus result behaved as expected.

Ready for review.
Feedback welcome on naming, default judge model, and placement within the metrics taxonomy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant