Skip to content

FEAT Add response_parser hook to SelfAskTrueFalseScorer with LlamaGuard support#1867

Open
immu4989 wants to merge 3 commits into
microsoft:mainfrom
immu4989:feat/llamaguard-scorer
Open

FEAT Add response_parser hook to SelfAskTrueFalseScorer with LlamaGuard support#1867
immu4989 wants to merge 3 commits into
microsoft:mainfrom
immu4989:feat/llamaguard-scorer

Conversation

@immu4989
Copy link
Copy Markdown
Contributor

Fixes #1830.

Implements the parser-pluggable approach @romanlutz approved in #1830. SelfAskTrueFalseScorer gains a response_parser hook so the same scorer can wrap fine-tuned classifiers like LlamaGuard whose output is not JSON. This avoids needing a new scorer class for every safety classifier and gives PyRIT a place to land ShieldGemma, WildGuard, and the HarmBench-paper classifier later without reinventing the abstraction.

Why a parser hook

SelfAskTrueFalseScorer's system prompt (true_false_system_prompt.yaml) instructs the scorer LLM to emit a JSON object with score_value, description, and rationale. Scorer._score_value_with_llm parses that JSON. The contract works for a general instruction-following LLM but breaks for LlamaGuard, which is a fine-tuned classifier whose output is hard-coded to "safe" or "unsafe\n<comma-separated category codes>". LlamaGuard ignores any "respond as JSON" instruction because that format is not part of its training. A parser override is required.

Changes

In pyrit/score/scorer.py, Scorer._score_value_with_llm gains an optional response_parser: Callable[[str], dict[str, Any]] kwarg. When provided, it replaces the default json.loads(remove_markdown_json(...)) step. Default behavior is unchanged. The edit also fixes a latent typing issue surfaced by stricter inference: score_value_description now defaults to "" when missing from the response.

SelfAskTrueFalseScorer (in pyrit/score/true_false/self_ask_true_false_scorer.py) gets a matching response_parser kwarg and threads it through to _score_value_with_llm. Existing callers see no change.

A new helper at pyrit/score/true_false/llamaguard_parser.py provides parse_llamaguard_response(text). It maps "safe" to score_value="False" and "unsafe\n<categories>" to score_value="True" with the violated category codes placed on score_metadata["violated_categories"]. On malformed output it raises InvalidJsonException so @pyrit_json_retry retries the LLM call.

Two new YAML assets ship under pyrit/datasets/score/true_false_question/:

  • llamaguard.yaml: a TrueFalseQuestion covering the MLCommons safety taxonomy (S1-S14) for the llamaguard category.
  • llamaguard_system_prompt.yaml: a system prompt template that fits PyRIT's system-prompt + user-message contract. The header documents that users wanting strict fidelity to the official Meta chat template can override via true_false_system_prompt_path.

pyrit/score/__init__.py exports parse_llamaguard_response.

Usage

from pyrit.score import SelfAskTrueFalseScorer, parse_llamaguard_response
from pyrit.score.true_false.self_ask_true_false_scorer import TRUE_FALSE_QUESTIONS_PATH

scorer = SelfAskTrueFalseScorer(
    chat_target=llamaguard_endpoint,  # any PromptChatTarget pointed at a LlamaGuard-serving endpoint
    true_false_question_path=TRUE_FALSE_QUESTIONS_PATH / "llamaguard.yaml",
    true_false_system_prompt_path=TRUE_FALSE_QUESTIONS_PATH / "llamaguard_system_prompt.yaml",
    response_parser=parse_llamaguard_response,
)
scores = await scorer.score_text_async("How do I synthesize a controlled substance?")
# scores[0].get_value() == True
# scores[0].score_metadata["violated_categories"] == "S2,S6"

Works with HuggingFace Inference, Together, Groq, Fireworks, a local vLLM/TGI, or any OpenAI-compatible endpoint serving Llama-Guard-3-8B, LlamaGuard-7B, or Llama-Guard-3-1B. No local transformers or torch dependency.

Tests

The new file tests/unit/score/test_llamaguard_parser.py contains 15 tests.

  • Pure parser coverage for safe, mixed-case Safe, whitespace, unsafe with single, multiple, missing, and empty category lines, plus empty input, a refusal string, and a malformed verdict.
  • Integration coverage running SelfAskTrueFalseScorer with response_parser=parse_llamaguard_response against a mocked target, for both safe and unsafe-with-categories paths.
  • A backwards-compat test confirming that omitting response_parser keeps the JSON parsing path.

Verification

# New tests
pytest tests/unit/score/test_llamaguard_parser.py
=> 15 passed in 1.20s

# Full unit suite, no regressions
pytest tests/unit -n auto
=> 8536 passed, 4 skipped in 33.56s   (15 new tests included)

# pre-commit (ruff format, ruff check, ty type check, etc.)
pre-commit run
=> all hooks Passed

Out of scope for this PR

Three natural follow-ons that fit the pattern introduced here:

  • A ShieldGemma scorer using the same response_parser plumbing.
  • Multimodal support via Llama-Guard-3-11B-Vision.
  • WildGuard and HarmBench-paper-classifier scorers.

…rd support

Per the design discussion in microsoft#1830, extend SelfAskTrueFalseScorer with an optional response_parser callable so the same scorer can wrap fine-tuned safety classifiers (LlamaGuard, ShieldGemma, WildGuard, HarmBench-paper) whose output is not JSON. Default behavior is unchanged.

Ships a parse_llamaguard_response helper plus YAML assets (TrueFalseQuestion and system prompt) so users can drop in any LlamaGuard-serving endpoint via PromptChatTarget. No local transformers or torch dependency.

Also fixes a latent typing issue in Scorer._score_value_with_llm: score_value_description now defaults to '' when the response omits the description field, instead of being None against a str-typed field.
Copy link
Copy Markdown
Contributor

@romanlutz romanlutz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a llama-guard deployment and can't test this. Can you confirm that you did test it?

@@ -0,0 +1,18 @@
category: llamaguard
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This YAML is added under pyrit/datasets/score/true_false_question/ but it's never referenced anywhere in the code: there's no TrueFalseQuestionPaths.LLAMAGUARD enum entry, no usage in the new tests, and the parser docstring doesn't mention it. Users following the integration tests as the example will construct a TrueFalseQuestion inline and never discover this file.

Same comment applies to llamaguard_system_prompt.yaml — it's not wired into anything either.

I'd suggest to wire them in: add a TrueFalseQuestionPaths.LLAMAGUARD enum value pointing at this file, and reference the system-prompt path from the parser's docstring (or expose it as a module-level constant alongside parse_llamaguard_response). That's the user-discoverable path.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, that was a discoverability gap. Wired up in the latest commit:

  • Added TrueFalseQuestionPaths.LLAMAGUARD pointing at llamaguard.yaml.
  • Exposed LLAMAGUARD_SYSTEM_PROMPT_PATH as a module-level constant in llamaguard_parser.py (also re-exported from pyrit.score).
  • Added a usage example to the llamaguard_parser module docstring that references both, so users following the parser as the entry point see the discoverable path immediately.

parameters:
- true_description
- false_description
- metadata
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameters declares true_description, false_description, and metadata, but the value: template below is fully static — none of these are referenced via {{ ... }}. render_template_value happily ignores extra kwargs, so this won't fail at runtime, but the declaration is misleading: someone editing the prompt later will assume the descriptions are interpolated and that overrides via true_false_question flow into the prompt. With LlamaGuard they don't (and shouldn't — the classifier ignores prompt-embedded categories anyway).

Either drop the parameters list, or actually reference the variables in the template if you want overrides to take effect.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, dropping the parameters: block in the latest commit. The template is fully static (LlamaGuard's training does the work, not prompt-embedded descriptions), so the declaration was misleading. If someone later wants to make the categories overridable from outside, they can add Jinja2 placeholders and the matching parameters: entries then.

Comment thread pyrit/score/scorer.py Outdated
Comment thread pyrit/score/__init__.py Outdated
Comment thread pyrit/score/scorer.py
Defaults to "category".
attack_identifier (Optional[ComponentIdentifier]): The attack identifier.
Defaults to None.
response_parser (Optional[Callable[[str], dict[str, Any]]]): Custom parser for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scorer needn't be LLM-based so I think we don't want it at this level. One could argue we should consider how inheritance/interfaces work here but that's a bit out of scope.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. The _score_value_with_llm helper is already an LLM-specific concern on the base class, so threading a parser through it felt consistent rather than additive. You are right that the more principled fix is an interface split (e.g., a LLMScorer mixin or a separate scoring helper class) and not just plumbing extras into an LLM-shaped method on a non-LLM base. Happy to scope that as a separate refactor issue if you want to track it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that'll be separate. I have to give that some thought but if you have a proposal feel free to open an issue.

@immu4989
Copy link
Copy Markdown
Contributor Author

immu4989 commented Jun 2, 2026

I don't have a llama-guard deployment and can't test this. Can you confirm that you did test it?

The PR has not been exercised against a live LlamaGuard endpoint. The tests cover the plumbing in two places:

  1. The pure-parser tests in tests/unit/score/test_llamaguard_parser.py validate the contract against every documented LlamaGuard output shape (safe, Safe (mixed case), whitespace, unsafe with single, multiple, missing, and empty category lines, plus empty input, refusal strings, and a malformed verdict).
  2. The integration tests feed the same canonical output shapes through a mocked PromptChatTarget into SelfAskTrueFalseScorer with response_parser=parse_llamaguard_response, asserting the full Score object lands correctly with violated_categories in metadata.

What is not covered: the actual LlamaGuard-3-8B chat template rendering and the round-trip through a real endpoint. If a live smoke test against Together/Groq/Fireworks would help you merge with confidence, I am happy to run one and paste the transcript here. I held off on writing that as a unit-suite test because it would require a configured API key and would not be reproducible in CI.

- Wire YAMLs into discoverable paths: add TrueFalseQuestionPaths.LLAMAGUARD and LLAMAGUARD_SYSTEM_PROMPT_PATH module-level constant.

- Drop misleading 'parameters' declaration in llamaguard_system_prompt.yaml; template is static.

- Switch :class: reST cross-references to plain double-backticks in scorer.py and self_ask_true_false_scorer.py (PyRIT docs build is MyST).

- Reorder __all__ in pyrit/score/__init__.py: parse_llamaguard_response between ObjectiveScorerMetrics and PlagiarismMetric, LLAMAGUARD_SYSTEM_PROMPT_PATH between LikertScalePaths and MarkdownInjectionScorer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT Add LlamaGuard scorer for safety classification of model outputs

2 participants