FEAT Bijection Learning attack#1909
Open
u7k4rs6 wants to merge 2 commits into
Open
Conversation
Author
|
@microsoft-github-policy-service agree |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1903.
Summary
Implements Bijection Learning (Huang et al., Haize Labs, arXiv:2410.01294, ICLR 2025), a scale-agnostic jailbreak that teaches a target model a randomly generated character mapping in-context, sends the objective encoded in that "bijection language," and decodes the response back to English. Because the mapping is random and unique per attempt, keyword and pattern-based defenses do not transfer, and the encoding complexity can be tuned to the target's capability. The paper reports up to an 86.3% attack success rate against Claude 3.5 Sonnet on HarmBench and finds the attack grows stronger on more capable models. It is listed in the MLCommons jailbreak taxonomy.
Design
Two pieces:
BijectionConverter(PromptConverter)is bidirectional via adirectionparameter. In"encode"mode it generates the mapping, builds the teaching preamble (mapping table plus N benign example pairs), and encodes the objective. In"decode"mode it inverts a supplied mapping with no preamble, suitable for use as a response converter (auto-detectsdigit_lengthfrom the supplied mapping; requirescustom_mapping). Encode mode follows the same shape asCaesarConverter/MorseConverter/AtbashConverter; decode mode runs throughconvert_asyncso it plugs into PyRIT's response-converter pipeline.BijectionLearningAttack(PromptSendingAttack)sends the plain objective and wires a fresh pair of converters per attempt for best-of-N. The encode converter is appended after any user-supplied request converters, so existing request converters run first and bijection encoding is last. A matching decode converter built from the same per-attempt mapping is prepended to the response converters, so decoding happens before any user response converters or the scorer see the text. Conversation setup, retry bookkeeping, andAttackResultconstruction are inherited fromPromptSendingAttack.The per-attempt mapping is the key constraint: encode and decode share the mapping for that attempt and are rebuilt independently each iteration.
Parameters
The two complexity controls come from the paper and are exposed for per-target sweeping (the optimum is model-dependent; stronger models are jailbroken by more complex mappings):
direction:"encode"(default) or"decode"for the response-converter rolemapping_type:"digit"(each remapped letter to a zero-padded numeric code) or"letter"(permuted alphabet)fixed_points: letters that map to themselves, range 0 to 25 (lower = more complex; 26 is rejected because it produces the identity mapping)digit_length: numeric code length for the"digit"variantnum_teaching_shots: number of benign example pairs in the teaching preambleseed:Nonefor a fresh mapping per instance, an int for reproducibilitycustom_mapping: supply a mapping directly (required in decode mode; mutually exclusive withseed/mapping_type/fixed_pointsin encode mode)append_description: prepend the teaching preamble (encode mode only)Usage
Tests
71 new tests, all passing, no regressions in the existing converter and single-turn attack suites (1,211 passed, 38 skipped across both).
test_bijection_converter.py(46): construction validation for both directions,fixed_points=26rejection, decode mode (requiredcustom_mapping, auto digit-length detection, encode-only params ignored), letter and digit roundtrips, digit decode with a fixed-point letter between numeric codes, mixed plaintext-framing robustness, truncated trailing digit, teaching preamble rendering, edge cases.test_bijection_learning.py(25): plain-objective send (no pre-encoding), encode converter appended to the request chain, decode converter prepended to the response chain, shared mapping between paired converters, fresh mapping per attempt, ordering relative to user-supplied converters, parameter exclusions.Files
New:
pyrit/prompt_converter/bijection_converter.pypyrit/datasets/prompt_converters/bijection_description.yamlpyrit/executor/attack/single_turn/bijection_learning.pyModified (exports):
pyrit/prompt_converter/__init__.pypyrit/executor/attack/single_turn/__init__.pypyrit/executor/attack/__init__.pyChecklist
pre-commithooks pass