Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add
tvd_mimetric (LLM-as-a-judge, corpus-level)Summary
Introduce a new corpus-level metric
tvd_miinto lighteval. This metric implements the TVD-MI approach from the paper Let’s Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes. It estimates a lower bound on total variation mutual information between model responses by using paired responses and an LLM critic.Implementation
Sample-level judge
JudgeLLMTVDMI(subclass ofJudgeLLM) configured withgpt-4o-2024-08-06via theopenaibackend.get_judge_prompt_tvdmi(response_a, response_b, ...)which asks the judge to distinguish A: SAME TASK/SOURCE vs B: DIFFERENT TASK/SOURCE.process_judge_response_tvdmi(...)to map responses → binary predictions:A→1,B→0; case/whitespace normalized; unknown → fallback0with warning.Corpus-level aggregation
CorpusLevelTVDMI, which accepts sample-dicts of the form{ "label": 0 or 1, "pred": 0 or 1, … }.where TPR = P(pred=1 | label=1) and TNR = P(pred=0 | label=0).
NaN.Metric registration
Metricsenum with:metric_name = "tvd_mi"sample_level_fn = JudgeLLMTVDMI()corpus_level_fn = CorpusLevelTVDMI()category = SamplingMethod.GENERATIVEhigher_is_better = TrueTests
New file:
tests/unit/metrics/test_tvd_mi.py, covering:NaNDocumentation
Updates metric list (LLM-as-Judge section) with:
Usage
To enable the metric in a task config:
Assumes the task formatter yields docs with:
response_a: strresponse_b: strpair_label: int(1or0)Validation
pytest tests/unit/metrics/test_tvd_mi.py -q pytest tests/unit/metrics -q \ --ignore=tests/unit/metrics/test_metric_requests.py \ -k "not extractiveness"Ready for review.
Feedback welcome on naming, default judge model, and placement within the metrics taxonomy.