feat(infrastructure): add VLM base classes and utilities by davidberenstein1957 · Pull Request #638 · PrunaAI/pruna

davidberenstein1957 · 2026-04-25T12:52:15Z

Summary

Adds the VLM inference infrastructure used by all downstream VLM judge metrics:

BaseVLM
LitellmVLM
TransformersVLM
StatefulVLMMeanScoresMetric
shared batch/device helpers

Stack Position

Base: PR feat(vendor): add LLM2Vec embedding model #637 (feat/vlm-pr-1-vendor)
Next: PR feat(text-metrics): split qa_accuracy #645 (feat/vlm-pr-3a-qa-accuracy)
Final integration: PR feat(e2e-tests): stacked e2e after split metrics #641 (feat/vlm-pr-5-e2e-tests)
Canonical umbrella reference: PR feat(evaluation): add VLMMetrics #545 (feat/metrics-vlm-support)

Files

src/pruna/evaluation/metrics/vlm_base.py
src/pruna/evaluation/metrics/vlm_utils.py
tests/evaluation/test_vlm_base_infrastructure.py
src/pruna/evaluation/metrics/utils.py
src/pruna/evaluation/metrics/__init__.py
pyproject.toml

Alignment Notes

This PR is intentionally based on feat/vlm-pr-1-vendor so reviewers only see infrastructure delta.

Test Plan

uv run pytest tests/evaluation/test_vlm_base_infrastructure.py -v

Review Focus

API/local VLM abstraction boundaries
Device handling and batching behavior
Stateful aggregation correctness

Review Flow (Order)

Review the stack in this exact order:

feat(vendor): add LLM2Vec embedding model #637 vendor
feat(infrastructure): add VLM base classes and utilities #638 infrastructure
feat(text-metrics): split qa_accuracy #645 qa_accuracy
feat(text-metrics): split oneig_alignment #646 oneig_alignment
feat(text-metrics): split text_score pair #647 text_score pair
feat(text-metrics): split oneig_reasoning #648 oneig_reasoning
feat(vision-metrics): split vqa #649 vqa
feat(vision-metrics): split vie_score #650 vie_score
feat(vision-metrics): split img_edit_score #651 img_edit_score
feat(e2e-tests): stacked e2e after split metrics #641 e2e tests

This PR in the flow (2/10)

Review after PR feat(vendor): add LLM2Vec embedding model #637.
Next PR to review: feat(text-metrics): split qa_accuracy #645.
Confirm this PR's tests and scope before continuing.

cursor

Cursor Bugbot has reviewed your changes and found 3 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

cursor · 2026-04-25T12:55:30Z

+                    top = getattr(tok, "top_logprobs", None) or []
+                    for t in top:
+                        token_str = (getattr(t, "token", "") or "").lower()
+                        lp = float(getattr(t, "logprob", -1e9) or -1e9)


Logprob zero treated as missing due to falsy check

Medium Severity

The expression float(getattr(t, "logprob", -1e9) or -1e9) uses the or operator to provide a fallback, but 0.0 is falsy in Python. A logprob of 0.0 means P = exp(0) = 1.0 (100% probability), yet 0.0 or -1e9 evaluates to -1e9, turning that into P ≈ 0. This silently corrupts probability scoring whenever a token has logprob exactly zero.

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

cursor · 2026-04-25T12:55:30Z

+        self.pooling_mode = pooling_mode
+        self.skip_instruction = skip_instruction
+        self.max_length = max_length
+        self.doc_max_length = 512


Constructor ignores doc_max_length parameter, hardcodes 512

Medium Severity

LLM2Vec.__init__ accepts a doc_max_length parameter (line 79) but line 88 assigns self.doc_max_length = 512 instead of self.doc_max_length = doc_max_length. The parameter value is silently discarded, so any doc_max_length loaded from llm2vec_config.json via from_pretrained or passed explicitly has no effect on document truncation behavior.

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

cursor · 2026-04-25T12:55:30Z

    "peft>=0.18.0,<0.19.0",
    "trl<=0.21.0",
    "termcolor==2.3.0",
+    "realesrgan",


Heavy realesrgan moved from optional to core dependencies

Medium Severity

realesrgan was previously under the optional upscale extra but is now a core dependency in dependencies. This forces all users to install a heavy GPU-oriented package (with native compilation requirements) even if they never use upscaling. The upscale optional extra was simultaneously removed.

^{Reviewed by Cursor Bugbot for commit 21212de. Configure here.}

Keep PR #638 focused on VLM infrastructure by removing exports for downstream metric classes and restoring Rapidata export from the base branch. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-19T00:29:32Z

This PR has been inactive for 10 days and is now marked as stale.

Logprob None check, shared OneIG grid helpers, pyproject extras restore, temporary CI on feat/vlm-pr-* bases, and clearer LiteLLM documentation. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 · 2026-06-02T17:30:45Z

Review follow-up

Logprob 0.0: Changed for a explicit None check in vlm_base.py.
LiteLLM: We use LiteLLM because it's an easy standard and integrates with any provider like OpenAI. API key resolution is documented in the vlm_base module docstring (more user-facing docs in feat(e2e-tests): stacked e2e after split metrics #641).
realesrgan / [evaluation] extras: Rebase artifacts reverted; realesrgan back under [upscale], original evaluation extras preserved + VLM deps added.
CI on stack branches: tests.yaml now runs on each feat/vlm-pr-* base (listed explicitly). Remove those entries before merging the stack to main.
__init__.py: Infra-only exports (no downstream metrics).

Stack rebased and pushed.

github-actions · 2026-06-30T00:30:38Z

This PR has been inactive for 10 days and is now marked as stale. It will be closed in 7 days if there is no further activity.

- Add BaseVLM abstract interface - Add LitellmVLM for API-based inference (OpenAI, Anthropic, etc.) - Add TransformersVLM for local Hugging Face models - Add StatefulVLMMeanScoresMetric base class for judge metrics - Add vlm_utils.py with image/batch utilities - Add pyproject.toml dependency pins (peft, litellm) - Add unit tests for infrastructure

Keep PR #638 focused on VLM infrastructure by removing exports for downstream metric classes and restoring Rapidata export from the base branch. Co-authored-by: Cursor <cursoragent@cursor.com>

Logprob None check, shared OneIG grid helpers, pyproject extras restore, temporary CI on feat/vlm-pr-* bases, and clearer LiteLLM documentation. Co-authored-by: Cursor <cursoragent@cursor.com>

Drop the broken Intel uv index (aligned with main), fix QAAccuracy keyword-only aggregation syntax, pass single/y_gt call types correctly for OneIG alignment, and expose metric_units on results. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace forward-import VLM test module on pre-e2e branches with infrastructure-only tests; propagate docstring and conftest fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Remove verify helper and duplicate infra test template from scripts/; tests live under tests/evaluation/ only. Co-authored-by: Cursor <cursoragent@cursor.com>

Match AlgorithmTag numpydoc pattern so docstring checks pass on Python 3.11. Co-authored-by: Cursor <cursoragent@cursor.com>

cursor Bot reviewed Apr 25, 2026

View reviewed changes

davidberenstein1957 changed the base branch from main to feat/vlm-pr-1-vendor May 5, 2026 10:00

davidberenstein1957 force-pushed the feat/vlm-pr-1-vendor branch from f89b047 to fb6d967 Compare May 8, 2026 09:01

davidberenstein1957 force-pushed the feat/vlm-pr-2-infrastructure branch from 21212de to 7054e53 Compare May 8, 2026 09:01

github-actions Bot added the stale label May 19, 2026

github-actions Bot removed the stale label Jun 19, 2026

github-actions Bot added the stale label Jun 30, 2026

davidberenstein1957 and others added 8 commits July 2, 2026 08:43

fix(infra): scope metrics exports to infra-only symbols

8280bcf

Keep PR #638 focused on VLM infrastructure by removing exports for downstream metric classes and restoring Rapidata export from the base branch. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(evaluation): VLM infra review fixes for stacked PR #638

ac1ff97

Logprob None check, shared OneIG grid helpers, pyproject extras restore, temporary CI on feat/vlm-pr-* bases, and clearer LiteLLM documentation. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(ci): lint/docstrings and stack-appropriate VLM tests

ee08890

Replace forward-import VLM test module on pre-e2e branches with infrastructure-only tests; propagate docstring and conftest fixes. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(ci): ruff on infra VLM test template

8b48dfa

Co-authored-by: Cursor <cursoragent@cursor.com>

chore: drop local-only scripts from PR scope

8e4f3d0

Remove verify helper and duplicate infra test template from scripts/; tests live under tests/evaluation/ only. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(metrics): document Enum boundary in TorchMetrics docstring

a22c48b

Match AlgorithmTag numpydoc pattern so docstring checks pass on Python 3.11. Co-authored-by: Cursor <cursoragent@cursor.com>

davidberenstein1957 force-pushed the feat/vlm-pr-2-infrastructure branch from 4eb78b7 to a22c48b Compare July 2, 2026 13:25

github-actions Bot removed the stale label Jul 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(infrastructure): add VLM base classes and utilities#638

feat(infrastructure): add VLM base classes and utilities#638
davidberenstein1957 wants to merge 8 commits into
feat/vlm-pr-1-vendorfrom
feat/vlm-pr-2-infrastructure

davidberenstein1957 commented Apr 25, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

cursor Bot Apr 25, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

davidberenstein1957 commented Jun 2, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

davidberenstein1957 commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack Position

Files

Alignment Notes

Test Plan

Review Focus

Review Flow (Order)

This PR in the flow (2/10)

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Logprob zero treated as missing due to falsy check

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Constructor ignores doc_max_length parameter, hardcodes 512

Uh oh!

cursor Bot Apr 25, 2026

Choose a reason for hiding this comment

Heavy realesrgan moved from optional to core dependencies

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

davidberenstein1957 commented Jun 2, 2026

Review follow-up

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidberenstein1957 commented Apr 25, 2026 •

edited

Loading

Constructor ignores `doc_max_length` parameter, hardcodes 512

Heavy `realesrgan` moved from optional to core dependencies