FEAT Add response_parser hook to SelfAskTrueFalseScorer with LlamaGuard support#1867
FEAT Add response_parser hook to SelfAskTrueFalseScorer with LlamaGuard support#1867immu4989 wants to merge 3 commits into
Conversation
…rd support Per the design discussion in microsoft#1830, extend SelfAskTrueFalseScorer with an optional response_parser callable so the same scorer can wrap fine-tuned safety classifiers (LlamaGuard, ShieldGemma, WildGuard, HarmBench-paper) whose output is not JSON. Default behavior is unchanged. Ships a parse_llamaguard_response helper plus YAML assets (TrueFalseQuestion and system prompt) so users can drop in any LlamaGuard-serving endpoint via PromptChatTarget. No local transformers or torch dependency. Also fixes a latent typing issue in Scorer._score_value_with_llm: score_value_description now defaults to '' when the response omits the description field, instead of being None against a str-typed field.
romanlutz
left a comment
There was a problem hiding this comment.
I don't have a llama-guard deployment and can't test this. Can you confirm that you did test it?
| @@ -0,0 +1,18 @@ | |||
| category: llamaguard | |||
There was a problem hiding this comment.
This YAML is added under pyrit/datasets/score/true_false_question/ but it's never referenced anywhere in the code: there's no TrueFalseQuestionPaths.LLAMAGUARD enum entry, no usage in the new tests, and the parser docstring doesn't mention it. Users following the integration tests as the example will construct a TrueFalseQuestion inline and never discover this file.
Same comment applies to llamaguard_system_prompt.yaml — it's not wired into anything either.
I'd suggest to wire them in: add a TrueFalseQuestionPaths.LLAMAGUARD enum value pointing at this file, and reference the system-prompt path from the parser's docstring (or expose it as a module-level constant alongside parse_llamaguard_response). That's the user-discoverable path.
There was a problem hiding this comment.
Good catch, that was a discoverability gap. Wired up in the latest commit:
- Added
TrueFalseQuestionPaths.LLAMAGUARDpointing atllamaguard.yaml. - Exposed
LLAMAGUARD_SYSTEM_PROMPT_PATHas a module-level constant inllamaguard_parser.py(also re-exported frompyrit.score). - Added a usage example to the
llamaguard_parsermodule docstring that references both, so users following the parser as the entry point see the discoverable path immediately.
| parameters: | ||
| - true_description | ||
| - false_description | ||
| - metadata |
There was a problem hiding this comment.
parameters declares true_description, false_description, and metadata, but the value: template below is fully static — none of these are referenced via {{ ... }}. render_template_value happily ignores extra kwargs, so this won't fail at runtime, but the declaration is misleading: someone editing the prompt later will assume the descriptions are interpolated and that overrides via true_false_question flow into the prompt. With LlamaGuard they don't (and shouldn't — the classifier ignores prompt-embedded categories anyway).
Either drop the parameters list, or actually reference the variables in the template if you want overrides to take effect.
There was a problem hiding this comment.
Agreed, dropping the parameters: block in the latest commit. The template is fully static (LlamaGuard's training does the work, not prompt-embedded descriptions), so the declaration was misleading. If someone later wants to make the categories overridable from outside, they can add Jinja2 placeholders and the matching parameters: entries then.
| Defaults to "category". | ||
| attack_identifier (Optional[ComponentIdentifier]): The attack identifier. | ||
| Defaults to None. | ||
| response_parser (Optional[Callable[[str], dict[str, Any]]]): Custom parser for |
There was a problem hiding this comment.
Scorer needn't be LLM-based so I think we don't want it at this level. One could argue we should consider how inheritance/interfaces work here but that's a bit out of scope.
There was a problem hiding this comment.
Noted. The _score_value_with_llm helper is already an LLM-specific concern on the base class, so threading a parser through it felt consistent rather than additive. You are right that the more principled fix is an interface split (e.g., a LLMScorer mixin or a separate scoring helper class) and not just plumbing extras into an LLM-shaped method on a non-LLM base. Happy to scope that as a separate refactor issue if you want to track it.
There was a problem hiding this comment.
Yes that'll be separate. I have to give that some thought but if you have a proposal feel free to open an issue.
The PR has not been exercised against a live LlamaGuard endpoint. The tests cover the plumbing in two places:
What is not covered: the actual LlamaGuard-3-8B chat template rendering and the round-trip through a real endpoint. If a live smoke test against Together/Groq/Fireworks would help you merge with confidence, I am happy to run one and paste the transcript here. I held off on writing that as a unit-suite test because it would require a configured API key and would not be reproducible in CI. |
- Wire YAMLs into discoverable paths: add TrueFalseQuestionPaths.LLAMAGUARD and LLAMAGUARD_SYSTEM_PROMPT_PATH module-level constant. - Drop misleading 'parameters' declaration in llamaguard_system_prompt.yaml; template is static. - Switch :class: reST cross-references to plain double-backticks in scorer.py and self_ask_true_false_scorer.py (PyRIT docs build is MyST). - Reorder __all__ in pyrit/score/__init__.py: parse_llamaguard_response between ObjectiveScorerMetrics and PlagiarismMetric, LLAMAGUARD_SYSTEM_PROMPT_PATH between LikertScalePaths and MarkdownInjectionScorer.
Fixes #1830.
Implements the parser-pluggable approach @romanlutz approved in #1830.
SelfAskTrueFalseScorergains aresponse_parserhook so the same scorer can wrap fine-tuned classifiers like LlamaGuard whose output is not JSON. This avoids needing a new scorer class for every safety classifier and gives PyRIT a place to land ShieldGemma, WildGuard, and the HarmBench-paper classifier later without reinventing the abstraction.Why a parser hook
SelfAskTrueFalseScorer's system prompt (true_false_system_prompt.yaml) instructs the scorer LLM to emit a JSON object withscore_value,description, andrationale.Scorer._score_value_with_llmparses that JSON. The contract works for a general instruction-following LLM but breaks for LlamaGuard, which is a fine-tuned classifier whose output is hard-coded to"safe"or"unsafe\n<comma-separated category codes>". LlamaGuard ignores any "respond as JSON" instruction because that format is not part of its training. A parser override is required.Changes
In
pyrit/score/scorer.py,Scorer._score_value_with_llmgains an optionalresponse_parser: Callable[[str], dict[str, Any]]kwarg. When provided, it replaces the defaultjson.loads(remove_markdown_json(...))step. Default behavior is unchanged. The edit also fixes a latent typing issue surfaced by stricter inference:score_value_descriptionnow defaults to""when missing from the response.SelfAskTrueFalseScorer(inpyrit/score/true_false/self_ask_true_false_scorer.py) gets a matchingresponse_parserkwarg and threads it through to_score_value_with_llm. Existing callers see no change.A new helper at
pyrit/score/true_false/llamaguard_parser.pyprovidesparse_llamaguard_response(text). It maps"safe"toscore_value="False"and"unsafe\n<categories>"toscore_value="True"with the violated category codes placed onscore_metadata["violated_categories"]. On malformed output it raisesInvalidJsonExceptionso@pyrit_json_retryretries the LLM call.Two new YAML assets ship under
pyrit/datasets/score/true_false_question/:llamaguard.yaml: aTrueFalseQuestioncovering the MLCommons safety taxonomy (S1-S14) for thellamaguardcategory.llamaguard_system_prompt.yaml: a system prompt template that fits PyRIT's system-prompt + user-message contract. The header documents that users wanting strict fidelity to the official Meta chat template can override viatrue_false_system_prompt_path.pyrit/score/__init__.pyexportsparse_llamaguard_response.Usage
Works with HuggingFace Inference, Together, Groq, Fireworks, a local vLLM/TGI, or any OpenAI-compatible endpoint serving Llama-Guard-3-8B, LlamaGuard-7B, or Llama-Guard-3-1B. No local
transformersortorchdependency.Tests
The new file
tests/unit/score/test_llamaguard_parser.pycontains 15 tests.safe, mixed-caseSafe, whitespace,unsafewith single, multiple, missing, and empty category lines, plus empty input, a refusal string, and a malformed verdict.SelfAskTrueFalseScorerwithresponse_parser=parse_llamaguard_responseagainst a mocked target, for both safe and unsafe-with-categories paths.response_parserkeeps the JSON parsing path.Verification
Out of scope for this PR
Three natural follow-ons that fit the pattern introduced here:
response_parserplumbing.Llama-Guard-3-11B-Vision.