FIX: Preserve DatasetConfiguration subclass when backend overrides dataset_names#1911
Open
varunj-msft wants to merge 1 commit into
Conversation
…es dataset_names ScenarioRunService._build_init_kwargs() used to construct a plain DatasetConfiguration whenever the caller passed dataset_names. This silently lost subclass-specific behavior such as EncodingDatasetConfiguration.get_all_seed_attack_groups(), which shapes each seed into a SeedAttackGroup with a synthetic objective. The downstream symptom for the Encoding scenario was: ValueError: SeedAttackGroup must have exactly one objective. Found 0. raised during attack construction. Reproducible end-to-end against the real garak_slur_terms_en dataset. Fix: when dataset_names is supplied, construct a fresh instance of the scenario's own default-dataset-config class so subclass overrides are preserved. Fall back to the plain DatasetConfiguration (with a logged warning) if a future subclass adds required __init__ kwargs we cannot populate. The max_dataset_size-only path keeps reusing-and-mutating the throwaway introspection instance's default config (no behavior change). Tests: - 5 new regression tests, all of which fail against pre-fix code. - All 30 existing tests still pass. - Full backend suite: 619 passed, 4 skipped. - Full scenario suite: 624 passed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When dataset_names is passed through the backend (ScenarioRunService._build_init_kwargs), we used to always construct a plain DatasetConfiguration. That silently dropped subclass-specific behavior — most notably EncodingDatasetConfiguration.get_all_seed_attack_groups(), which shapes each seed into a SeedAttackGroup with a synthetic objective.
For garak.encoding this surfaced as a confusing runtime error during attack construction:
ValueError: SeedAttackGroup must have exactly one objective. Found 0.
Reproducible end-to-end against the real garak_slur_terms_en dataset.
Fix: when dataset_names is supplied, build a fresh instance of the scenario's own default-dataset-config class so subclass overrides are preserved. If a future subclass adds required init kwargs we can't populate, fall back to the plain DatasetConfiguration with a logged warning so the operator has a trail.
The max_dataset_size-only path is unchanged — it still mutates the throwaway introspection instance's default config.
First in a series of small PRs for the Standardizing Scenarios work . Lands ahead of the Encoding scenario standardization PR, which depends on this fix to make the documented fast path usable via the API.
Tests and Documentation