Skip to content

FEAT: Add TatweelConverter for Arabic kashida insertion#1869

Merged
romanlutz merged 7 commits into
microsoft:mainfrom
Raulster24:raulster24/add-tatweel-converter
Jun 2, 2026
Merged

FEAT: Add TatweelConverter for Arabic kashida insertion#1869
romanlutz merged 7 commits into
microsoft:mainfrom
Raulster24:raulster24/add-tatweel-converter

Conversation

@Raulster24
Copy link
Copy Markdown
Contributor

Description

Adds TatweelConverter, a deterministic PromptConverter that inserts the Arabic tatweel (kashida, U+0640) between adjacent Arabic letters. The tatweel is a connector that visually elongates a word without changing its meaning, so the output stays legible to a reader while changing the underlying code point and token sequence. Characters outside the main Arabic block, and Arabic letters not directly followed by another Arabic letter, are left untouched. tatweel_count controls how many tatweel are inserted per gap.

Second in a small set of atomic Arabic-script converters, following BidiConverter (#1832).

cc @romanlutz

Tests and Documentation

  • Added tests/unit/prompt_converter/test_tatweel_converter.py: single-pair and multi-pair insertion, tatweel_count scaling, non-Arabic boundary handling, non-Arabic passthrough, empty input, determinism, invalid-count rejection, and unsupported-input-type rejection. All pass: uv run pytest tests/unit/prompt_converter/test_tatweel_converter.py
  • Registered in pyrit/prompt_converter/__init__.py (import + __all__).
  • Added a usage example to doc/code/converters/1_text_to_text_converters.py and regenerated the paired .ipynb plus the converter modality table in 0_converters.ipynb via JupyText (uv run jupytext --sync).
  • ruff and ty are clean; the converter-documentation conformance test passes.

@romanlutz romanlutz enabled auto-merge June 1, 2026 21:54
romanlutz and others added 2 commits June 1, 2026 16:48
@romanlutz romanlutz added this pull request to the merge queue Jun 2, 2026
Merged via the queue into microsoft:main with commit 376e000 Jun 2, 2026
48 checks passed
romanlutz added a commit to romanlutz/PyRIT that referenced this pull request Jun 2, 2026
Brings in 3 new commits from main:

- 648faa9 FEAT: Backfill class-level metadata for all remote seed datasets (microsoft#1780)
- 092126d MAINT: Migrate AddImage/AddTextImage converter deprecations to print_deprecation_message (microsoft#1875)
- 376e000 FEAT: Add TatweelConverter for Arabic kashida insertion (microsoft#1869)

Conflict resolution (11 files): took main's version everywhere
(`git checkout --theirs`), then re-ran `ruff check --fix` to
re-apply the PEP 604 sweep to main's new code (~36 violations
auto-fixed). Same hand-fix for the runtime `Optional[dict]` in
`pyrit/models/message_piece.py` PlainSerializer `return_type`
that ruff can't auto-rewrite.

Verification:
- ruff check pyrit/ tests/ doc/ - clean
- ruff format --check - clean
- pytest tests/unit -n 4 - 8977 passed, 5 skipped, 0 failures

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants