Add vision evaluation metrics (exact_match, relaxed_accuracy, word_sort_ratio)#2476
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends Olive’s evaluator framework with three vision-oriented, text-based accuracy sub-metrics intended for VQA/ChartQA/OCR-style evaluation, adding corresponding task registrations, data pre-processing, and unit tests.
Changes:
- Add three new
AccuracySubTypevalues (exact_match,relaxed_accuracy,word_sort_ratio) and implement their metric logic. - Introduce a vision string-inference path in both ONNX and PyTorch evaluators, including task↔metric compatibility validation.
- Register new HuggingFace task types and add a vision VQA pre-process component plus unit tests for the new metrics.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
olive/evaluator/metric.py |
Adds new vision accuracy sub-types to the enum. |
olive/evaluator/accuracy.py |
Implements ExactMatch, RelaxedAccuracy, and WordSortRatio. |
olive/evaluator/olive_evaluator.py |
Adds vision inference paths and task/metric validation for vision metrics. |
olive/data/component/pre_process_data.py |
Adds vision_vqa_pre_process that emits (image, question) inputs and string answers. |
olive/data/container/huggingface_container.py |
Registers new vision task types mapping to the vision pre-process component. |
olive/olive_config.json |
Adds a vision extra dependency (pillow). |
test/evaluator/test_accuracy.py |
Adds unit tests for all new metric classes. |
| Note: This returns raw PIL images and question strings. For the PyTorch evaluator, | ||
| the model's own processor/tokenizer should be applied in the post_func or within | ||
| the model's forward method. For the ONNX evaluator, provide a custom pre-process | ||
| component that applies the appropriate processor/tokenizer to produce numeric | ||
| tensors matching the model's io_config. |
| # Extract task from pre_process_data_config params, which is how HuggingfaceContainer | ||
| # maps task types (e.g., "vision-vqa", "vision-chart-qa", "vision-ocr") to components. | ||
| pre_process_config = metric.data_config.pre_process_data_config | ||
| if pre_process_config and pre_process_config.params: | ||
| task_type = pre_process_config.params.get("task") |
| def _is_vision_metric(metric: "Metric") -> bool: | ||
| """Check if metric uses vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio). | ||
|
|
||
| Raises ValueError if vision sub-types are mixed with non-vision sub-types, | ||
| as they require different inference paths. | ||
| """ | ||
| if metric.type != MetricType.ACCURACY: | ||
| return False | ||
| vision_based = [sub.name in _VISION_ACCURACY_SUBTYPES for sub in metric.sub_types] | ||
| if any(vision_based) and not all(vision_based): | ||
| raise ValueError( | ||
| "Cannot mix vision accuracy sub-types (exact_match, relaxed_accuracy, word_sort_ratio) " | ||
| "with other sub-types in the same metric. Please define them as separate metrics." | ||
| ) | ||
| return all(vision_based) |
…rt_ratio) Add vision evaluation metrics to the Olive evaluator framework, enabling VQA, ChartQA, and OCR model evaluation. - exact_match: case-insensitive string equality for VQA tasks - relaxed_accuracy: ±5% numeric tolerance for ChartQA - word_sort_ratio: word-level overlap ratio for OCR Changes: - olive/evaluator/metric.py: Add EXACT_MATCH, RELAXED_ACCURACY, WORD_SORT_RATIO to AccuracySubType - olive/evaluator/accuracy.py: Add ExactMatch, RelaxedAccuracy, WordSortRatio classes - olive/evaluator/olive_evaluator.py: Add _inference_vision() path and task-metric validation - olive/data/component/pre_process_data.py: Add vision_vqa_pre_process data component - olive/data/container/huggingface_container.py: Add vision-vqa, vision-chart-qa, vision-ocr tasks - olive/olive_config.json: Add vision extra dependencies (Pillow) - test/evaluator/test_accuracy.py: Add 20 unit tests for vision metrics Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Document which datasets each vision metric is suitable for: - exact_match: AI2D, ScienceQA, TextVQA, MathVista, MMMU, InterGPS - relaxed_accuracy: ChartQA - word_sort_ratio: OCR Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix _validate_vision_task_metric to extract task from pre_process_data_config.params['task'] instead of non-existent DataConfig attributes - Wrap _VISION_ACCURACY_SUBTYPES across multiple lines for lint compliance - Use lowercase 'pillow' in olive_config.json for consistency - Add docstring note about ONNX vs PyTorch path for vision_vqa_pre_process Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix isinstance check to include tuple (shaahji) - Support multiple valid answers via | separator instead of taking first only - Add vision_vqa_dataloader with custom collate_fn for PIL images (Copilot) - Register vision_vqa_dataloader for vision task types in HuggingfaceContainer - Simplify relaxed_accuracy numeric comparison (shaahji) - Add clarifying comment about single-pass vs generation models (jambayk) - Add vision metrics documentation with public HuggingFace dataset examples (facebook/textvqa, HuggingFaceM4/ChartQA, HuggingFaceM4/DocumentVQA) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Fix misleading docstring about post_func for input processing - Add clarifying comment in collate_fn about answer handling - Improve _validate_vision_task_metric to handle missing task param - Add unit tests for vision metric validation functions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6330ed9 to
8f1a4ba
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@jiafatom Closing this PR. Please open a new one directly in Olive repo. PRs from forked repos can't be built and verified without additional support from core others. PRs in Olive repo auto run the verification process. |
I thought this is from Olive repo? https://github.com/microsoft/Olive/tree/jiafa/add-vision-eval-metrics |
Summary
Extends Olive's evaluator framework with three vision-oriented accuracy sub-metrics for VQA, ChartQA, and OCR evaluation, following the existing pattern used for speech metrics (PR #2444).
New Metrics
exact_matchvision-vqarelaxed_accuracyvision-chart-qaword_sort_ratiovision-ocrPublic HuggingFace Datasets
These metrics are designed to work with publicly available datasets:
exact_matchfacebook/textvqarelaxed_accuracyHuggingFaceM4/ChartQAword_sort_ratioHuggingFaceM4/DocumentVQAExample configuration snippets are provided in
docs/source/how-to/configure-workflows/metrics-configuration.md.Changes
olive/evaluator/metric.py: AddsEXACT_MATCH,RELAXED_ACCURACY,WORD_SORT_RATIOtoAccuracySubTypeenumolive/evaluator/accuracy.py: Implements the three metric classes with multi-answer supportolive/evaluator/olive_evaluator.py: Adds vision inference path and task-metric validationolive/data/component/pre_process_data.py: Addsvision_vqa_pre_processcomponentolive/data/component/dataloader.py: Addsvision_vqa_dataloaderwith custom collate_fn for PIL imagesolive/data/container/huggingface_container.py: Registersvision-vqa,vision-chart-qa,vision-ocrtask types with appropriate dataloaderolive/olive_config.json: Addsvisionextras (pillow)docs/source/how-to/configure-workflows/metrics-configuration.md: Adds vision metrics documentation with public dataset examplestest/evaluator/test_accuracy.py: Unit tests covering all new metricsDesign
|separator (metrics match against any valid answer)ValueErrorvision_vqa_dataloaderhandles PIL images with a collate_fn that avoids PyTorch default collation issues