Skip to content

FEAT Bijection Learning attack#1909

Open
u7k4rs6 wants to merge 2 commits into
microsoft:mainfrom
u7k4rs6:feat/bijection-learning
Open

FEAT Bijection Learning attack#1909
u7k4rs6 wants to merge 2 commits into
microsoft:mainfrom
u7k4rs6:feat/bijection-learning

Conversation

@u7k4rs6
Copy link
Copy Markdown

@u7k4rs6 u7k4rs6 commented Jun 3, 2026

Closes #1903.

Summary

Implements Bijection Learning (Huang et al., Haize Labs, arXiv:2410.01294, ICLR 2025), a scale-agnostic jailbreak that teaches a target model a randomly generated character mapping in-context, sends the objective encoded in that "bijection language," and decodes the response back to English. Because the mapping is random and unique per attempt, keyword and pattern-based defenses do not transfer, and the encoding complexity can be tuned to the target's capability. The paper reports up to an 86.3% attack success rate against Claude 3.5 Sonnet on HarmBench and finds the attack grows stronger on more capable models. It is listed in the MLCommons jailbreak taxonomy.

Design

Two pieces:

  • BijectionConverter(PromptConverter) is bidirectional via a direction parameter. In "encode" mode it generates the mapping, builds the teaching preamble (mapping table plus N benign example pairs), and encodes the objective. In "decode" mode it inverts a supplied mapping with no preamble, suitable for use as a response converter (auto-detects digit_length from the supplied mapping; requires custom_mapping). Encode mode follows the same shape as CaesarConverter / MorseConverter / AtbashConverter; decode mode runs through convert_async so it plugs into PyRIT's response-converter pipeline.
  • BijectionLearningAttack(PromptSendingAttack) sends the plain objective and wires a fresh pair of converters per attempt for best-of-N. The encode converter is appended after any user-supplied request converters, so existing request converters run first and bijection encoding is last. A matching decode converter built from the same per-attempt mapping is prepended to the response converters, so decoding happens before any user response converters or the scorer see the text. Conversation setup, retry bookkeeping, and AttackResult construction are inherited from PromptSendingAttack.

The per-attempt mapping is the key constraint: encode and decode share the mapping for that attempt and are rebuilt independently each iteration.

Parameters

The two complexity controls come from the paper and are exposed for per-target sweeping (the optimum is model-dependent; stronger models are jailbroken by more complex mappings):

  • direction: "encode" (default) or "decode" for the response-converter role
  • mapping_type: "digit" (each remapped letter to a zero-padded numeric code) or "letter" (permuted alphabet)
  • fixed_points: letters that map to themselves, range 0 to 25 (lower = more complex; 26 is rejected because it produces the identity mapping)
  • digit_length: numeric code length for the "digit" variant
  • num_teaching_shots: number of benign example pairs in the teaching preamble
  • seed: None for a fresh mapping per instance, an int for reproducibility
  • custom_mapping: supply a mapping directly (required in decode mode; mutually exclusive with seed / mapping_type / fixed_points in encode mode)
  • append_description: prepend the teaching preamble (encode mode only)

Usage

attack = BijectionLearningAttack(
    objective_target=target,
    objective_scorer=scorer,
    mapping_type="digit",
    fixed_points=13,
    digit_length=2,
    num_teaching_shots=5,
)
result = await attack.execute_async(objective="...")

Tests

71 new tests, all passing, no regressions in the existing converter and single-turn attack suites (1,211 passed, 38 skipped across both).

  • test_bijection_converter.py (46): construction validation for both directions, fixed_points=26 rejection, decode mode (required custom_mapping, auto digit-length detection, encode-only params ignored), letter and digit roundtrips, digit decode with a fixed-point letter between numeric codes, mixed plaintext-framing robustness, truncated trailing digit, teaching preamble rendering, edge cases.
  • test_bijection_learning.py (25): plain-objective send (no pre-encoding), encode converter appended to the request chain, decode converter prepended to the response chain, shared mapping between paired converters, fresh mapping per attempt, ordering relative to user-supplied converters, parameter exclusions.

Files

New:

  • pyrit/prompt_converter/bijection_converter.py
  • pyrit/datasets/prompt_converters/bijection_description.yaml
  • pyrit/executor/attack/single_turn/bijection_learning.py

Modified (exports):

  • pyrit/prompt_converter/__init__.py
  • pyrit/executor/attack/single_turn/__init__.py
  • pyrit/executor/attack/__init__.py

Checklist

  • pre-commit hooks pass
  • Unit tests added and passing locally
  • No regressions in existing converter and attack tests
  • Docstrings on the new converter and attack
  • Docs or demo entry if the project expects one for new converters/attacks

@u7k4rs6
Copy link
Copy Markdown
Author

u7k4rs6 commented Jun 3, 2026

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT Bijection

1 participant