I build local-first evaluation infrastructure for coding agents.
Most AI coding demos answer a soft question: can the agent produce something that looks plausible? My work asks the harder engineering question:
Can an agent solve a real task mined from Git history, under hidden tests, with a trace we can inspect and a run ledger we can verify later?
That is the center of this GitHub profile: small, inspectable systems for agentic AI evaluation, RAG stress testing, reproducibility, context engineering, and correctness-focused developer tools.
flowchart LR
A["Git history"] --> B["PatchGym<br/>mine real coding-agent tasks"]
B --> C["Hidden tests<br/>oracle patches<br/>validation command"]
C --> D["Agent run"]
D --> E["manifest.json<br/>trace.jsonl<br/>report.json"]
E --> F["TraceWeave<br/>failure forensics"]
E --> G["SandboxLedger<br/>tamper-evident ledger"]
H["Context Crucible"] --> D
I["SpecMutate"] --> C
J["RAGNeedle"] --> K["retrieval stress tests"]
| System | Role | Why It Is Worth Reading |
|---|---|---|
| PatchGym | Local SWE-bench-style task miner and runner | Mines real Git history into hidden-test coding-agent tasks with auditable oracle patches. |
| TraceWeave | Agent trajectory forensics | Reads local traces and finds loops, tool churn, context drift, causal handoffs, and risk signals. |
| SandboxLedger | Reproducibility ledger | Hashes PatchGym run artifacts into an append-only ledger with previous-hash chaining and a Merkle root. |
| Context Crucible | Coding-agent context packer | Scores repository files, budgets context, and guards against hidden-test or oracle leakage. |
| RAGNeedle | Adversarial RAG benchmark generator | Creates deterministic needle-in-corpus retrieval tasks with distractor pressure and citation metrics. |
| SpecMutate | Metamorphic test generator | Turns behavior specs into deterministic test vectors for parsers, CLIs, normalizers, and small tools. |
git clone https://github.com/nripankadas07/patchgym
cd patchgym
python -m pip install -e ".[dev]"
python -m pip install git+https://github.com/nripankadas07/traceweave
python -m pip install git+https://github.com/nripankadas07/sandboxledger
patchgym demo --keep-dir /tmp/patchgym-proof
traceweave patchgym /tmp/patchgym-proof/runs/oracle --json
sandboxledger ingest-patchgym /tmp/patchgym-proof-ledger.jsonl /tmp/patchgym-proof/runs/oracle
sandboxledger verify /tmp/patchgym-proof-ledger.jsonlThat flow produces:
- a real mined coding-agent task;
- hidden-test validation;
manifest.jsonwith commit ids, patch hashes, artifact hashes, return codes, changed files, and totals;trace.jsonlfor forensic analysis;- a verifiable SandboxLedger record for the run.
This is the profile thesis in executable form: agent evaluation should leave evidence, not just screenshots.
| Time | Read / Run |
|---|---|
| 2 minutes | PatchGym README and bash scripts/demo.sh |
| 5 minutes | PatchGym reproducible runs |
| 7 minutes | TraceWeave PatchGym traces |
| 10 minutes | SandboxLedger PatchGym ingestion |
| 15 minutes | Visible Agent Evaluation |
I use AI heavily, but I do not want AI-assisted software to be judged by vibes. The systems here are built around harder boundaries:
- hidden tests instead of self-reported success;
- traces instead of opaque agent transcripts;
- manifests instead of loose claims;
- hash ledgers instead of mutable screenshots;
- local-first demos instead of hosted black boxes;
- small parsers and utilities with adversarial tests instead of broad, untestable abstractions.
The result is a portfolio with one technical identity:
local-first infrastructure for evaluating, debugging, and hardening coding agents.
These repositories support the flagship stack without competing with it.
| Area | Projects |
|---|---|
| Agent and eval infrastructure | agent-framework, rag-pipeline, prompt-eval, token-counter, ai-toolkit |
| Correctness substrate | safejson, tomlmini, bencode, csvinfer, urlnorm, jsonptr, jsonpatch-lite |
| TypeScript systems primitives | decimal-ts, lru-ts, task-queue, tokenring-ts, eventbus-ts, decoder-ts |
| Local-first product labs | lanbeam, rssdeck, passhouse, syncplan, readmine, photoflow, dnswarden, medialoom, chatmux, uptimelog |
Every active repository is expected to have tests, CI, license metadata, issue templates, a pull request template, security notes, contribution notes, and a clear docs or examples surface.
Last audited on May 28, 2026 across the live public GitHub profile.
| Signal | Current State |
|---|---|
| Public repositories | 116 total: 115 active, 1 archived scratchpad |
| Active repo hygiene | 115/115 have README, license metadata, license file, CI, issue templates, and PR templates |
| Latest completed CI | 115/115 active repos passing at audit time |
| Docs/examples surface | 115/115 active repos |
| Research launch | 5 new local-first agent/eval projects shipped on May 28, 2026 |
| Flagship integration | PatchGym emits run manifests and traces; TraceWeave analyzes them; SandboxLedger records them |
| Open issue load | 0 open issues across active repositories at audit time |
Audit notes:
I use AI for scaffolding, test generation, edge-case brainstorming, and first-pass documentation. The architecture, project boundaries, quality bar, final review, and public positioning are mine.
AI-assisted output has to survive source-checkout setup, local tests, CI, security notes, limitation notes, and manual review before it becomes part of the public portfolio. That is why the profile emphasizes reproducible demos and auditable artifacts instead of fake adoption badges or inflated benchmark claims.
- PatchGym: Local Coding-Agent Benchmarks From Real Git History
- Visible Agent Evaluation: Testing The Loop, Not The Demo
- Safe Local-First AI Tooling: Small Systems With Hard Boundaries
This GitHub profile is intentionally code-first. Career credentials, product leadership context, and publication context live on LinkedIn.
For bugs, design questions, or focused collaboration, open an issue on the relevant repository. For profile-level context, use nripankadas07/nripankadas07.