Skip to content

Add: Ethereum address detection to CryptoRecognizer#1837

Open
kyoungbinkim wants to merge 6 commits intomicrosoft:mainfrom
kyoungbinkim:crypto/eth
Open

Add: Ethereum address detection to CryptoRecognizer#1837
kyoungbinkim wants to merge 6 commits intomicrosoft:mainfrom
kyoungbinkim:crypto/eth

Conversation

@kyoungbinkim
Copy link
Copy Markdown
Contributor

@kyoungbinkim kyoungbinkim commented Jan 25, 2026

Add Ethereum address detection to CryptoRecognizer

Summary

Extends CryptoRecognizer to detect and validate Ethereum (ETH) addresses using EIP-55 checksum validation.

Ref

Changes

  • Added ETH address pattern recognition (0x[a-fA-F0-9]{40})
  • Implemented EIP-55 checksum validation using Keccak-256
  • Added eth-hash[pycryptodome] dependency
  • Added comprehensive test cases for ETH addresses

Testing

  • All existing tests pass
  • Added 5+ test cases for ETH validation (valid/invalid checksums)
❯ ruff check .
All checks passed!
❯ poetry run pytest tests/test_crypto_recognizer.py
================================ test session starts ================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0
rootdir: /home/kbin/workspace/presidio/presidio-analyzer
configfile: pyproject.toml
plugins: anyio-4.12.1, cov-7.0.0, mock-3.15.1
collected 15 items                                                                  

tests/test_crypto_recognizer.py ...............                               [100%]

================================= warnings summary ==================================
../.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480
../.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480
../.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480
../.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480
../.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480
../.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480
../.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480
  /home/kbin/workspace/presidio/.venv/lib/python3.12/site-packages/torch/jit/_script.py:1480: DeprecationWarning: `torch.jit.script` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================== 15 passed, 7 warnings in 0.13s ===========================

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests
  • All unit tests and lint checks pass locally
  • My PR contains documentation updates / additions if required

@kyoungbinkim kyoungbinkim requested a review from a team as a code owner January 25, 2026 07:32
"pyyaml",
"phonenumbers (>=8.12,<10.0.0)",
"pydantic (>=2.0.0,<3.0.0)",
"eth-hash[pycryptodome]>=0.5.0",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a simple way to add this functionality without relying on this 3rd party package?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omri374 Since hashlib does not support Keccak, I used an alternative approach. Without relying on external libraries, I had to implement it directly in Python. It is likely to be inefficient.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# keccak256.py
# Pure Python Keccak-256 (NOT SHA3-256)

ROT = [
 [0, 36, 3, 41, 18],
 [1, 44, 10, 45, 2],
 [62, 6, 43, 15, 61],
 [28, 55, 25, 21, 56],
 [27, 20, 39, 8, 14],
]

RC = [
 0x0000000000000001, 0x0000000000008082,
 0x800000000000808A, 0x8000000080008000,
 0x000000000000808B, 0x0000000080000001,
 0x8000000080008081, 0x8000000000008009,
 0x000000000000008A, 0x0000000000000088,
 0x0000000080008009, 0x000000008000000A,
 0x000000008000808B, 0x800000000000008B,
 0x8000000000008089, 0x8000000000008003,
 0x8000000000008002, 0x8000000000000080,
 0x000000000000800A, 0x800000008000000A,
 0x8000000080008081, 0x8000000000008080,
 0x0000000080000001, 0x8000000080008008,
]

def rol(x, n):
    return ((x << n) | (x >> (64 - n))) & 0xFFFFFFFFFFFFFFFF

def keccak_f(state):
    for rnd in range(24):
        # θ
        C = [state[x] ^ state[x+5] ^ state[x+10] ^ state[x+15] ^ state[x+20] for x in range(5)]
        D = [C[(x-1)%5] ^ rol(C[(x+1)%5], 1) for x in range(5)]
        for x in range(5):
            for y in range(5):
                state[x + 5*y] ^= D[x]

        # ρ + π
        B = [0]*25
        for x in range(5):
            for y in range(5):
                B[y + 5*((2*x+3*y)%5)] = rol(state[x + 5*y], ROT[x][y])

        # χ
        for x in range(5):
            for y in range(5):
                state[x + 5*y] = B[x + 5*y] ^ ((~B[(x+1)%5 + 5*y]) & B[(x+2)%5 + 5*y])

        # ι
        state[0] ^= RC[rnd]

    return state


def keccak_256(data: bytes) -> bytes:
    rate = 1088 // 8  # 136 bytes
    state = [0] * 25

    # Padding (Keccak padding 0x01 ... 0x80)
    padded = bytearray(data)
    padded.append(0x01)
    while (len(padded) % rate) != rate - 1:
        padded.append(0x00)
    padded.append(0x80)

    # Absorb
    for i in range(0, len(padded), rate):
        block = padded[i:i+rate]
        for j in range(rate // 8):
            state[j] ^= int.from_bytes(block[8*j:8*j+8], "little")
        state = keccak_f(state)

    # Squeeze
    out = bytearray()
    while len(out) < 32:
        for j in range(rate // 8):
            out += state[j].to_bytes(8, "little")
        if len(out) >= 32:
            break
        state = keccak_f(state)

    return bytes(out[:32])

there is keccak-256 python source from GPT

Copy link
Copy Markdown
Contributor

@SharonHart SharonHart Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyoungbinkim
any chance hashlib sha-3 can be used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, when Ethereum was implemented, SHA-3 had not yet been standardized, so ETH adopted Keccak-256. That is why it is not implemented in hashlib. So I’m not sure whether it’s possible to implement Keccak-256 using hashlib.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the checksum check worth the extra codebase additions or is the regex is specific enough?
@omri374 ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, the regex patterns seem specific enough. If we see it causes many false positives, we can add the validation logic. I'd vote for not adding another dependency just for this at this point.

Copy link
Copy Markdown
Contributor

@SharonHart SharonHart Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omri374, as well as implementing ourselves, only if deemed necessary?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyoungbinkim would you be interested in continuing this, considering the conversation here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants