FEAT Bijection Learning attack by u7k4rs6 · Pull Request #1909 · microsoft/PyRIT

u7k4rs6 · 2026-06-03T19:46:34Z

Closes #1903.

Summary

Implements Bijection Learning (Huang et al., Haize Labs, arXiv:2410.01294, ICLR 2025), a scale-agnostic jailbreak that teaches a target model a randomly generated character mapping in-context, sends the objective encoded in that "bijection language," and decodes the response back to English. Because the mapping is random and unique per attempt, keyword and pattern-based defenses do not transfer, and the encoding complexity can be tuned to the target's capability. The paper reports up to an 86.3% attack success rate against Claude 3.5 Sonnet on HarmBench and finds the attack grows stronger on more capable models. It is listed in the MLCommons jailbreak taxonomy.

Design

Two pieces:

BijectionConverter(PromptConverter) is bidirectional via a direction parameter. In "encode" mode it generates the mapping, builds the teaching preamble (mapping table plus N benign example pairs), and encodes the objective. In "decode" mode it inverts a supplied mapping with no preamble, suitable for use as a response converter (auto-detects digit_length from the supplied mapping; requires custom_mapping). Encode mode follows the same shape as CaesarConverter / MorseConverter / AtbashConverter; decode mode runs through convert_async so it plugs into PyRIT's response-converter pipeline.
BijectionLearningAttack(PromptSendingAttack) sends the plain objective and wires a fresh pair of converters per attempt for best-of-N. The encode converter is appended after any user-supplied request converters, so existing request converters run first and bijection encoding is last. A matching decode converter built from the same per-attempt mapping is prepended to the response converters, so decoding happens before any user response converters or the scorer see the text. Conversation setup, retry bookkeeping, and AttackResult construction are inherited from PromptSendingAttack.

The per-attempt mapping is the key constraint: encode and decode share the mapping for that attempt and are rebuilt independently each iteration.

Parameters

The two complexity controls come from the paper and are exposed for per-target sweeping (the optimum is model-dependent; stronger models are jailbroken by more complex mappings):

direction: "encode" (default) or "decode" for the response-converter role
mapping_type: "digit" (each remapped letter to a zero-padded numeric code) or "letter" (permuted alphabet)
fixed_points: letters that map to themselves, range 0 to 25 (lower = more complex; 26 is rejected because it produces the identity mapping)
digit_length: numeric code length for the "digit" variant
num_teaching_shots: number of benign example pairs in the teaching preamble
seed: None for a fresh mapping per instance, an int for reproducibility
custom_mapping: supply a mapping directly (required in decode mode; mutually exclusive with seed / mapping_type / fixed_points in encode mode)
append_description: prepend the teaching preamble (encode mode only)

Usage

attack = BijectionLearningAttack(
    objective_target=target,
    objective_scorer=scorer,
    mapping_type="digit",
    fixed_points=13,
    digit_length=2,
    num_teaching_shots=5,
)
result = await attack.execute_async(objective="...")

Tests

71 new tests, all passing, no regressions in the existing converter and single-turn attack suites (1,211 passed, 38 skipped across both).

test_bijection_converter.py (46): construction validation for both directions, fixed_points=26 rejection, decode mode (required custom_mapping, auto digit-length detection, encode-only params ignored), letter and digit roundtrips, digit decode with a fixed-point letter between numeric codes, mixed plaintext-framing robustness, truncated trailing digit, teaching preamble rendering, edge cases.
test_bijection_learning.py (25): plain-objective send (no pre-encoding), encode converter appended to the request chain, decode converter prepended to the response chain, shared mapping between paired converters, fresh mapping per attempt, ordering relative to user-supplied converters, parameter exclusions.

Files

New:

pyrit/prompt_converter/bijection_converter.py
pyrit/datasets/prompt_converters/bijection_description.yaml
pyrit/executor/attack/single_turn/bijection_learning.py

Modified (exports):

pyrit/prompt_converter/__init__.py
pyrit/executor/attack/single_turn/__init__.py
pyrit/executor/attack/__init__.py

Checklist

pre-commit hooks pass
Unit tests added and passing locally
No regressions in existing converter and attack tests
Docstrings on the new converter and attack
Docs or demo entry if the project expects one for new converters/attacks

…soft#1903)

u7k4rs6 · 2026-06-03T19:52:25Z

@microsoft-github-policy-service agree

u7k4rs6 and others added 2 commits June 4, 2026 01:14

FEAT add BijectionConverter and BijectionLearningAttack (closes micro…

edbfc15

…soft#1903)

Merge branch 'main' into feat/bijection-learning

edfda50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT Bijection Learning attack#1909

FEAT Bijection Learning attack#1909
u7k4rs6 wants to merge 2 commits into
microsoft:mainfrom
u7k4rs6:feat/bijection-learning

u7k4rs6 commented Jun 3, 2026

Uh oh!

u7k4rs6 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

u7k4rs6 commented Jun 3, 2026

Summary

Design

Parameters

Usage

Tests

Files

Checklist

Uh oh!

u7k4rs6 commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant