Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) by jiafatom · Pull Request #2476 · microsoft/Olive

jiafatom · 2026-05-27T20:25:34Z

Summary

Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).

New Metrics

Metric	Task Type	Suitable Benchmarks
`exact_match`	`vision-vqa`	AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS
`relaxed_accuracy`	`vision-chart-qa`	ChartQA (±5% numeric tolerance for numbers)
`word_sort_ratio`	`vision-ocr`	OCR benchmarks (word-level overlap)

Public HuggingFace Datasets

These metrics are designed to work with publicly available datasets:

Metric	Recommended Dataset	HuggingFace ID
`exact_match`	TextVQA	`facebook/textvqa`
`relaxed_accuracy`	ChartQA	`HuggingFaceM4/ChartQA`
`word_sort_ratio`	DocumentVQA	`HuggingFaceM4/DocumentVQA`

Example configuration snippets are provided in docs/source/how-to/configure-workflows/metrics-configuration.md.

Changes

olive/evaluator/metric.py: Adds EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType enum
olive/evaluator/accuracy.py: Implements the three metric classes with multi-answer support
olive/evaluator/olive_evaluator.py: Adds vision inference path and task-metric validation
olive/data/component/pre_process_data.py: Adds vision_vqa_pre_process component
olive/data/component/dataloader.py: Adds vision_vqa_dataloader with custom collate_fn for PIL images
olive/data/container/huggingface_container.py: Registers vision-vqa, vision-chart-qa, vision-ocr task types with appropriate dataloader
olive/olive_config.json: Adds vision extras (pillow)
docs/source/how-to/configure-workflows/metrics-configuration.md: Adds vision metrics documentation with public dataset examples
test/evaluator/test_accuracy.py: Unit tests covering all new metrics

Design

Vision metrics are text-based (compare predicted answer string to ground truth), task-dependent
Multiple valid answers supported via | separator (metrics match against any valid answer)
Task-metric validation ensures incompatible combinations raise ValueError
Custom vision_vqa_dataloader handles PIL images with a collate_fn that avoids PyTorch default collation issues
PyTorch path: model processor handles images natively
ONNX path: single forward pass assumed (classification-style VQA); for autoregressive models use PyTorch evaluator with generation loop

Copilot

Pull request overview

This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.

Changes:

Add three new AccuracySubType values (exact_match, relaxed_accuracy, word_sort_ratio) and implement their metric logic.
Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`olive/evaluator/metric.py`	Adds new vision accuracy sub-types to the enum.
`olive/evaluator/accuracy.py`	Implements `ExactMatch`, `RelaxedAccuracy`, and `WordSortRatio`.
`olive/evaluator/olive_evaluator.py`	Adds vision inference paths and task/metric validation for vision metrics.
`olive/data/component/pre_process_data.py`	Adds `vision_vqa_pre_process` that emits (image, question) inputs and string answers.
`olive/data/container/huggingface_container.py`	Registers new vision task types mapping to the vision pre-process component.
`olive/olive_config.json`	Adds a `vision` extra dependency (`pillow`).
`test/evaluator/test_accuracy.py`	Adds unit tests for all new metric classes.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

+    Note: This returns raw PIL images and question strings. For the PyTorch evaluator,
+    the model's own processor/tokenizer should be applied in the post_func or within
+    the model's forward method. For the ONNX evaluator, provide a custom pre-process
+    component that applies the appropriate processor/tokenizer to produce numeric
+    tensors matching the model's io_config.


+        # Extract task from pre_process_data_config params, which is how HuggingfaceContainer
+        # maps task types (e.g., "vision-vqa", "vision-chart-qa", "vision-ocr") to components.
+        pre_process_config = metric.data_config.pre_process_data_config
+        if pre_process_config and pre_process_config.params:
+            task_type = pre_process_config.params.get("task")


+def _is_vision_metric(metric: "Metric") -> bool:
+    """Check if metric uses vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio).
+
+    Raises ValueError if vision sub-types are mixed with non-vision sub-types,
+    as they require different inference paths.
+    """
+    if metric.type != MetricType.ACCURACY:
+        return False
+    vision_based = [sub.name in _VISION_ACCURACY_SUBTYPES for sub in metric.sub_types]
+    if any(vision_based) and not all(vision_based):
+        raise ValueError(
+            "Cannot mix vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio) "
+            "with other sub-types in the same metric. Please define them as separate metrics."
+        )
+    return all(vision_based)


…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix isinstance check to include tuple (shaahji) - Support multiple valid answers via | separator instead of taking first only - Add vision_vqa_dataloader with custom collate_fn for PIL images (Copilot) - Register vision_vqa_dataloader for vision task types in HuggingfaceContainer - Simplify relaxed_accuracy numeric comparison (shaahji) - Add clarifying comment about single-pass vs generation models (jambayk) - Add vision metrics documentation with public HuggingFace dataset examples (facebook/textvqa, HuggingFaceM4/ChartQA, HuggingFaceM4/DocumentVQA) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Fix misleading docstring about post_func for input processing - Add clarifying comment in collate_fn about answer handling - Improve _validate_vision_task_metric to handle missing task param - Add unit tests for vision metric validation functions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

shaahji · 2026-05-29T21:30:45Z

@jiafatom Closing this PR. Please open a new one directly in Olive repo. PRs from forked repos can't be built and verified without additional support from core others. PRs in Olive repo auto run the verification process.

jiafatom · 2026-05-29T21:36:25Z

@jiafatom Closing this PR. Please open a new one directly in Olive repo. PRs from forked repos can't be built and verified without additional support from core others. PRs in Olive repo auto run the verification process.

I thought this is from Olive repo? https://github.com/microsoft/Olive/tree/jiafa/add-vision-eval-metrics

Copilot AI review requested due to automatic review settings May 27, 2026 20:25

Copilot started reviewing on behalf of jiafatom May 27, 2026 20:25 View session

jiafatom mentioned this pull request May 27, 2026

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio) #2474

Closed

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread olive/data/container/huggingface_container.py

Comment thread olive/data/component/pre_process_data.py Outdated

jambayk reviewed May 27, 2026

View reviewed changes

Comment thread olive/evaluator/olive_evaluator.py

shaahji requested changes May 27, 2026

View reviewed changes

Comment thread olive/data/component/pre_process_data.py Outdated

Comment thread olive/data/component/pre_process_data.py

Comment thread olive/evaluator/accuracy.py Outdated

jiafatom requested review from Copilot, jambayk and shaahji May 28, 2026 21:14

Copilot started reviewing on behalf of jiafatom May 28, 2026 21:14 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

jiafatom and others added 6 commits May 29, 2026 17:52

Remove internal project references from comments

d26d59f

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom force-pushed the jiafa/add-vision-eval-metrics branch from 6330ed9 to 8f1a4ba Compare May 29, 2026 17:52

github-advanced-security AI found potential problems May 29, 2026

View reviewed changes

Comment thread test/evaluator/test_olive_evaluator.py Fixed

Comment thread test/evaluator/test_olive_evaluator.py Fixed

Remove redundant MagicMock reimport in vision metric tests

78708f8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

shaahji closed this May 29, 2026

shaahji reopened this May 29, 2026

shaahji approved these changes May 29, 2026

View reviewed changes

jiafatom enabled auto-merge (squash) May 29, 2026 21:55

jiafatom merged commit a0e8805 into main May 29, 2026
17 checks passed

jiafatom deleted the jiafa/add-vision-eval-metrics branch May 29, 2026 22:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476

Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
jiafatom merged 7 commits into
mainfrom
jiafa/add-vision-eval-metrics

jiafatom commented May 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

shaahji commented May 29, 2026

Uh oh!

jiafatom commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jiafatom commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Metrics

Public HuggingFace Datasets

Changes

Design

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

shaahji commented May 29, 2026

Uh oh!

jiafatom commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiafatom commented May 27, 2026 •

edited

Loading