Describe the bug
CorpusLevelTranslationMetric passes references to sacrebleu in [sent_id][ref_id] shape (per-sample list), but sacrebleu expects [ref_id][sent_id] (reference streams). This causes chrF/chrF++/TER to score only the first min number of refs per sample hypotheses and compare them against a pooled set of references from the entire dataset. BLEU is special-cased but still drops extra references.
To Reproduce
Minimal example showing the shape issue (mirrors how lighteval currently passes refs):
from lighteval.metrics.metrics_corpus import CorpusLevelTranslationMetric
from lighteval.metrics.sample_preparator import GenerativeCorpusMetricInput
from lighteval.utils.utils import as_list
items = [
GenerativeCorpusMetricInput(golds=["GOOD"], preds=["GOOD"]),
GenerativeCorpusMetricInput(golds=["REF2"], preds=["PRED2"]),
]
metric = CorpusLevelTranslationMetric("chrf++")
# Mirrors compute_corpus(): each i.golds is Sequence[str], so this produces
# Sequence[Sequence[str]] in per‑sample orientation.
golds = [i.golds for i in items] # [sent_id][ref_id]
preds = [as_list(i.preds)[0] for i in items]
# Shows only one hypothesis is being scored:
stats = metric.get_metric()._extract_corpus_statistics(preds, golds)
print(len(stats)) # 1 (should be 2)
score_wrong = metric.get_metric().corpus_score(preds, golds).score
print(score_wrong) # 100 despite 2nd hyp being wrong (0 for TER)
Expected behavior
Each hypothesis should be scored against its own reference set, and corpus statistics should include all hypotheses (len(stats) == len(hypotheses)).
Version info
- lighteval: 0.13.0
- Python: 3.13
- Dependencies: sacrebleu 2.5.1
Suspected root cause
GenerativeCorpusMetricInput.golds is list[str] (per-sample refs) (src/lighteval/metrics/sample_preparator.py).
compute_corpus() does golds = [i.golds for i in items], producing list[list[str]]. This type passes, but the orientation is per-sample, not per-reference (src/lighteval/metrics/metrics_corpus.py).
- sacrebleu expects
[ref_id][sent_id] and builds per-segment refs via zip(*references), then pairs them with hypotheses using zip(hypotheses, ref_cache), truncating to the number of refs (sacrebleu/metrics/base.py).
- chrF++ picks the best ref among those provided (sacrebleu/metrics/chrf.py
_compute_segment_statistics, where best_f_score is updated per ref). The best match is usually the corresponding reference (e.g., ref1 for hyp1) but not necessarily, which can inflate scores. TER uses the same base machinery (sacrebleu/metrics/ter.py + sacrebleu/metrics/base.py).
Suggested fix
In metrics_corpus.py, CorpusLevelTranslationMetric class, transpose golds before calling sacrebleu so it matches [ref_id][sent_id]:
from itertools import zip_longest # zip_longest to account for variable num of refs per sample
# inside compute_corpus(), before corpus_score(...)
golds = [list(refs) for refs in zip_longest(*golds, fillvalue=None)]
We can also consider applying the same transpose for BLEU to keep multi-reference support instead of dropping to gold[0].
If I missed something or this is intended behavior, please let me know.
Describe the bug
CorpusLevelTranslationMetricpasses references to sacrebleu in[sent_id][ref_id]shape (per-sample list), but sacrebleu expects[ref_id][sent_id](reference streams). This causes chrF/chrF++/TER to score only the first min number of refs per sample hypotheses and compare them against a pooled set of references from the entire dataset. BLEU is special-cased but still drops extra references.To Reproduce
Minimal example showing the shape issue (mirrors how lighteval currently passes refs):
Expected behavior
Each hypothesis should be scored against its own reference set, and corpus statistics should include all hypotheses (len(stats) == len(hypotheses)).
Version info
Suspected root cause
GenerativeCorpusMetricInput.goldsislist[str](per-sample refs) (src/lighteval/metrics/sample_preparator.py).compute_corpus()doesgolds = [i.golds for i in items], producinglist[list[str]]. This type passes, but the orientation is per-sample, not per-reference (src/lighteval/metrics/metrics_corpus.py).[ref_id][sent_id]and builds per-segment refs viazip(*references), then pairs them with hypotheses usingzip(hypotheses, ref_cache), truncating to the number of refs (sacrebleu/metrics/base.py)._compute_segment_statistics, wherebest_f_scoreis updated per ref). The best match is usually the corresponding reference (e.g., ref1 for hyp1) but not necessarily, which can inflate scores. TER uses the same base machinery (sacrebleu/metrics/ter.py + sacrebleu/metrics/base.py).Suggested fix
In metrics_corpus.py,
CorpusLevelTranslationMetricclass, transposegoldsbefore calling sacrebleu so it matches[ref_id][sent_id]:We can also consider applying the same transpose for BLEU to keep multi-reference support instead of dropping to
gold[0].If I missed something or this is intended behavior, please let me know.