feat(skills): add agent reproduction workflows#4118
Conversation
3ebded8 to
184a416
Compare
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
Critical No tests exist for any of the 3 Python scripts or 2 Bash scripts introduced by this PR. The agent-reproduce-feature/SKILL.md Done Criteria require "at least one verification path" and "focused tests or a reproducible smoke command." The core pipeline (HTTP capture → trace normalization → trace comparison) is entirely untested — any regression in normalize_trace.py or compare_traces.py will silently produce misleading parity results.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
|
Question: Has this workflow been exercised end-to-end against a real parity case (e.g., reproducing a specific The concern is a tool credibility gap: this skill exists to verify parity, but if its own A concrete suggestion: before merging, include at least one recorded parity run (even a small scenario like a single tool-call comparison) with its expected diff output committed as a fixture. This would serve as both:
Without this, reviewers and future users have no way to distinguish "the agents are aligned" from "the comparison tool says they're aligned." |
|
@pomelo-nwu Yes, #4120 was implemented using this skill. I only ran the skill once and made a small amount of manual fine-tuning afterwards; it was very useful. |
Reviewer wenshao flagged that the documented commands use 'skills/...' while the skill files live under '.qwen/skills/...'. Running the documented examples from the repo root fails with No such file or directory. Update all script invocations in the two agent-reproduce SKILL.md files and capture-workflow.md to use the correct '.qwen/skills/...' prefix. Documentation only; no code changes.
Reviewer flagged that '--set ssl_insecure=true' silently disables upstream TLS verification. Add a 3-line comment above the mitmdump invocation explaining the trade-off and that it is intended for local dev only.
| text = content_sample.decode("utf-8") | ||
| except UnicodeDecodeError: | ||
| if truncated: | ||
| text = content_sample.decode("utf-8", errors="ignore") |
There was a problem hiding this comment.
[Suggestion] When content is non-UTF-8 binary AND truncated, the code decodes with errors="ignore" and returns kind: "text" with garbled data. When the same content is non-truncated, it correctly returns kind: "base64". This inconsistency means truncated binary content gets a misleading kind label, and downstream normalize_trace.py tries to parse garbled text as JSON.
| text = content_sample.decode("utf-8", errors="ignore") | |
| except UnicodeDecodeError: | |
| return { | |
| "kind": "base64", | |
| "base64": base64.b64encode(content_sample).decode("ascii"), | |
| "truncated": truncated, | |
| } |
Remove the if truncated: branch — both truncated and non-truncated binary should use base64 encoding.
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
| trap cleanup EXIT | ||
|
|
||
| sleep "${REPRO_TMUX_SETTLE_SECONDS:-2}" | ||
| tmux capture-pane -t "${session}" -p -S - > "${out_dir}/tmux-pane.txt" |
There was a problem hiding this comment.
[Suggestion] The cleanup trap at line 26 kills the tmux session on EXIT by default, but the script prints attach=tmux attach -t ${session} to stdout and tmux-session.txt. By the time the caller reads these instructions, the session is already dead. Users following the printed attach command will get session not found errors.
| tmux capture-pane -t "${session}" -p -S - > "${out_dir}/tmux-pane.txt" | |
| { | |
| echo "session=${session}" | |
| if [[ "${REPRO_TMUX_KEEP_SESSION:-0}" == "1" ]]; then | |
| echo "attach=tmux attach -t ${session}" | |
| echo "capture=tmux capture-pane -t ${session} -p -S - > ${out_dir}/tmux-pane.txt" | |
| echo "kill=tmux kill-session -t ${session}" | |
| else | |
| echo "attach=(session killed on exit; re-run with REPRO_TMUX_KEEP_SESSION=1 to keep)" | |
| fi | |
| } > "${out_dir}/tmux-session.txt" |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
| "openai-organization", | ||
| "openai-project", | ||
| } | ||
| SENSITIVE_KEY_RE = re.compile( |
There was a problem hiding this comment.
[Suggestion] SENSITIVE_KEY_RE includes bare terms cookie and session as alternations. Because re.search matches anywhere in the key string, any JSON key containing these substrings — session_id, session_timeout, max_sessions, cookie_name, cookie_policy — is treated as sensitive and its entire value is replaced with "[REDACTED]".
This cascades through _redact_json to destroy tool parameter schemas: a parameter like "session_id": {"type": "string", "description": "..."} has its entire value redacted, making it impossible to compare tool definitions between traces — directly undermining the alignment workflow's purpose.
| SENSITIVE_KEY_RE = re.compile( | |
| SENSITIVE_KEY_RE = re.compile( | |
| r"(?i)(api[-_]?key|authorization|password|secret|token|credential|" | |
| r"access[-_]?token|refresh[-_]?token|client[-_]?secret|" | |
| r"session[-_]?(?:id|token|key|secret)|cookie[-_]?(?:id|name|value|token))" | |
| ) |
Replace bare cookie and session with compound patterns that require a sensitive suffix, so session_id still matches but max_sessions does not.
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Test coverage gaps — The PR adds 4 Python scripts (~1,120 lines) and 3 shell scripts with zero test files. The most critical untested paths are: (1) _decode() branches in llm_dump.py (binary+truncated silently switches to lossy decode instead of base64); (2) all 9 TOKEN_PATTERNS redaction regexes in llm_dump.py; (3) normalize_schema() recursive handling of nested JSON Schema with $ref, anyOf, additionalProperties; (4) summarize_tool() for OpenAI vs Anthropic vs legacy tool formats. These are the core correctness paths that the entire comparison pipeline depends on.
— qwen3.7-max via Qwen Code /review
| return value | ||
|
|
||
|
|
||
| def _redact_url(url: str) -> str: |
There was a problem hiding this comment.
[Suggestion] _redact_url() only redacts sensitive patterns in query parameters. URL path segments and userinfo (embedded credentials) pass through untouched.
- Path tokens: A URL like
/api/token/abc123def/endpointis written verbatim. Some API designs embed tokens in path segments. - Userinfo:
urlparse("https://user:sk-xxx@host/path")preservesuser:sk-xxxin the netloc._replace(query=...)does not touchparsed.username/parsed.password.
| def _redact_url(url: str) -> str: | |
| def _redact_url(url: str) -> str: | |
| parsed = urlparse(url) | |
| if parsed.username or parsed.password: | |
| netloc = parsed.hostname or "" | |
| if parsed.port: | |
| netloc = f"{netloc}:{parsed.port}" | |
| parsed = parsed._replace(netloc=netloc) | |
| query = [] | |
| for key, value in parse_qsl(parsed.query, keep_blank_values=True): | |
| query.append((key, "[REDACTED]" if SENSITIVE_KEY_RE.search(key) else value)) | |
| safe_path = _redact_text(parsed.path) | |
| safe_fragment = _redact_text(parsed.fragment) | |
| return urlunparse(parsed._replace( | |
| query=urlencode(query, doseq=True), | |
| path=safe_path, | |
| fragment=safe_fragment, | |
| )) |
— qwen3.7-max via Qwen Code /review
| r"(?i)(api[-_]?key|authorization|cookie|password|secret|token|credential|" | ||
| r"access[-_]?token|refresh[-_]?token|client[-_]?secret|session)" | ||
| ) | ||
| TOKEN_PATTERNS = ( |
There was a problem hiding this comment.
[Suggestion] TOKEN_PATTERNS requires a keyword prefix (Bearer, Basic, Token) for generic tokens. Bare JWTs like eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiIxMjM0NTY3ODkwIn0.dozjgNryP4J3jVmNHl0w5N_XgL0n3I9PlFUP0THsR8U are not matched. JWTs appear in LLM tool results (auth context, session tokens) and API responses.
Consider adding a JWT-shaped pattern:
| TOKEN_PATTERNS = ( | |
| TOKEN_PATTERNS = ( | |
| (re.compile(r"(?i)\bbearer\s+[a-z0-9._~+/=-]+"), "Bearer [REDACTED]"), | |
| (re.compile(r"(?i)\bbasic\s+[a-z0-9._~+/=-]+"), "Basic [REDACTED]"), | |
| (re.compile(r"(?i)\btoken\s+[a-z0-9._~+/=-]+"), "Token [REDACTED]"), | |
| (re.compile(r"\bsk-[A-Za-z0-9_-]{12,}\b"), "sk-[REDACTED]"), | |
| (re.compile(r"\bAKIA[0-9A-Z]{16}\b"), "AKIA[REDACTED]"), | |
| (re.compile(r"\bAIza[0-9A-Za-z_-]{20,}\b"), "AIza[REDACTED]"), | |
| (re.compile(r"\b(?:ghp|gho|ghu|ghs)_[A-Za-z0-9_]{20,}\b"), "gh_[REDACTED]"), | |
| (re.compile(r"\bgithub_pat_[A-Za-z0-9_]{20,}\b"), "github_pat_[REDACTED]"), | |
| (re.compile(r"\beyJ[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\.[A-Za-z0-9_-]{10,}\b"), "eyJ[REDACTED]"), | |
| ( | |
| re.compile( | |
| r"-----BEGIN\s+[\w\s]+PRIVATE\s+KEY-----.*?-----END\s+[\w\s]+PRIVATE\s+KEY-----", | |
| re.DOTALL, | |
| ), | |
| "-----BEGIN PRIVATE KEY-----[REDACTED]-----END PRIVATE KEY-----", | |
| ), | |
| ) |
— qwen3.7-max via Qwen Code /review
| parsed = urlparse(req.get("url", "")) | ||
| url_path = parsed.path | ||
| if parsed.query: | ||
| url_path = f"{url_path}?{parsed.query}" |
There was a problem hiding this comment.
[Suggestion] The normalized url_path now preserves the full query string (fixing the prior "query strings invisible" issue), but this introduces the opposite problem: query parameters commonly contain ephemeral values (deployment versions, request IDs, timestamps, cache-busters) that differ between runs.
compare_traces.py will report spurious request[N].url_path mismatches for otherwise identical calls. The alignment rules explicitly say "Do not chase provider-specific endpoints, model names, IDs, timestamps, token counts, or ephemeral headers."
Since meaningful parameters for LLM endpoints are in the POST body (already compared via body_keys and body_values), consider dropping the query string or making it opt-in:
| url_path = f"{url_path}?{parsed.query}" | |
| url_path = parsed.path |
— qwen3.7-max via Qwen Code /review
| echo "qwen_stdout=${out_dir}/qwen/command.stdout" | ||
| echo "qwen_stderr=${out_dir}/qwen/command.stderr" | ||
|
|
||
| if [[ "${reference_status}" -ne 0 || "${qwen_status}" -ne 0 || "${normalize_ref_status}" -ne 0 || "${normalize_qwen_status}" -ne 0 || "${compare_status}" -ne 0 ]]; then |
There was a problem hiding this comment.
[Suggestion] compare_traces.py uses a clean three-way exit protocol: 0 = no diffs, 1 = diffs found, 2 = load error. But this guard collapses all non-zero sub-statuses — including compare_status=1 (expected diffs during iteration) — into exit 1.
The calling agent cannot distinguish "expected parity gap, keep iterating" from "infrastructure failure" without fragile stdout parsing. Consider distinct exit codes:
| if [[ "${reference_status}" -ne 0 || "${qwen_status}" -ne 0 || "${normalize_ref_status}" -ne 0 || "${normalize_qwen_status}" -ne 0 || "${compare_status}" -ne 0 ]]; then | |
| if [[ "${reference_status}" -ne 0 || "${qwen_status}" -ne 0 ]]; then | |
| echo "FAILED: command execution (ref=${reference_status}, qwen=${qwen_status})" >&2 | |
| exit 2 | |
| elif [[ "${normalize_ref_status}" -ne 0 || "${normalize_qwen_status}" -ne 0 ]]; then | |
| echo "FAILED: trace normalization" >&2 | |
| exit 3 | |
| elif [[ "${compare_status}" -eq 2 ]]; then | |
| echo "FAILED: comparison error" >&2 | |
| exit 4 | |
| elif [[ "${compare_status}" -eq 1 ]]; then | |
| exit 1 | |
| fi |
— qwen3.7-max via Qwen Code /review
`is_sensitive_path()` previously split rel_path on `[/._ -]+`, which both let SSH private keys leak (`id_rsa` -> ["id", "rsa"], no match) and hid benign tooling files from state diffs (`tokenizer.json` -> ["token", ...] matched `token`). Match whole `/`-separated segments and check the basename — and its stem without trailing extension — directly against `SENSITIVE_PATH_PARTS`. Hidden directories like `.ssh` / `.gnupg` keep matching their non-dot equivalents so existing directory-level coverage is preserved.
A JSONL trace line that decodes to a non-object (`[]`, `"hello"`, `42`,
`null`) bypassed the `json.JSONDecodeError` handler and crashed the whole
`normalize()` run with an `AttributeError` on `raw.get("request")`.
Add an `isinstance(raw, dict)` guard alongside the existing decode-error
branch — emit the same `Warning: skipping ...` shape on stderr and
continue with the next line so a single odd entry no longer wipes out
the normalized output.
The sed redaction in run_with_mitm.sh covered sk-*, ghp_*, github_pat_*,
and generic api_key/token/secret/credential= patterns, but did not
include AKIA[0-9A-Z]{16} (AWS access keys) or AIza[0-9A-Za-z_-]{20,}
(Google API keys), even though llm_dump.py and capture_state.py both
already handle them. If those credentials were passed as CLI arguments,
they were written in plaintext to env.txt.
Add the two patterns so all three redaction surfaces stay in sync.
run_with_mitm.sh forced NO_PROXY="" and no_proxy="" so that ALL outbound HTTP/HTTPS traffic (including localhost calls to MCP servers, dev servers, and health checks) was funneled through the mitm proxy. When the wrapped command interacted with a localhost service that did not speak HTTP through the proxy, this surfaced as confusing connection errors that looked like application bugs rather than wrapper-induced. Set NO_PROXY="localhost,127.0.0.1" (and the lowercase variant) so localhost calls bypass the proxy while remote traffic still gets captured. urllib's proxy_bypass() honors the new value (verified locally: localhost and 127.0.0.1 bypassed, example.com still proxied).
wenshao
left a comment
There was a problem hiding this comment.
CI failing: Test (windows-latest, Node 22.x).
Additionally, no test files exist for any of the 4 Python scripts (~1,120 lines) or 3 shell scripts. The security-critical redaction logic (is_sensitive_path, TOKEN_PATTERNS, PEM regex, _redact_json) has zero test coverage. Several findings below (F1, F2, F3) would have been caught by simple unit tests with known inputs and expected outputs.
|
|
||
| def entry_for_symlink(path: Path) -> dict[str, Any]: | ||
| try: | ||
| target = os.readlink(path) |
There was a problem hiding this comment.
[Critical] os.readlink(path) target is stored verbatim with zero redaction — no home path replacement, no is_sensitive_path check.
A symlink at a non-sensitive relative path (e.g. projects/link) pointing to /Users/wenshao/.ssh/id_rsa would leak the full absolute target path including the home directory and the existence/location of credential files into state-manifest.json.
| target = os.readlink(path) | |
| def entry_for_symlink(path: Path) -> dict[str, Any]: | |
| try: | |
| target = os.readlink(path) | |
| except OSError: | |
| target = None | |
| if target is not None: | |
| home = str(Path.home()) | |
| target = target.replace(home, "~") | |
| if is_sensitive_path(target): | |
| return {"kind": "symlink", "target": "<sensitive_path>"} | |
| return {"kind": "symlink", "target": target} |
— qwen3.7-max via Qwen Code /review
| ) | ||
| TOKEN_PATTERNS = ( | ||
| (re.compile(r"(?i)\bbearer\s+[a-z0-9._~+/=-]+"), "Bearer [REDACTED]"), | ||
| (re.compile(r"(?i)\bbasic\s+[a-z0-9._~+/=-]+"), "Basic [REDACTED]"), |
There was a problem hiding this comment.
[Critical] (?i)\bbasic\s+[a-z0-9._~+/=-]+ matches common English phrases — "basic usage", "bearer minimum", "token effort" all match. Since _redact_json applies _redact_text to every JSON string value, user messages, tool descriptions, and assistant responses get corrupted: "Explain the basic usage" → "Explain the Basic [REDACTED]".
Note: capture_state.py:GENERIC_AUTH_RE at line 118 uses {8,} (8-char minimum) — the correct pattern. This should match.
| (re.compile(r"(?i)\bbasic\s+[a-z0-9._~+/=-]+"), "Basic [REDACTED]"), | |
| (re.compile(r"(?i)\bbearer\s+[A-Za-z0-9._~+/=-]{8,}"), "Bearer [REDACTED]"), | |
| (re.compile(r"(?i)\bbasic\s+[A-Za-z0-9._~+/=-]{8,}"), "Basic [REDACTED]"), | |
| (re.compile(r"(?i)\btoken\s+[A-Za-z0-9._~+/=-]{8,}"), "Token [REDACTED]"), |
— qwen3.7-max via Qwen Code /review
| (re.compile(r"\bgithub_pat_[A-Za-z0-9_]{20,}\b"), "github_pat_[REDACTED]"), | ||
| ( | ||
| re.compile( | ||
| r"-----BEGIN\s+[\w\s]+PRIVATE\s+KEY-----.*?-----END\s+[\w\s]+PRIVATE\s+KEY-----", |
There was a problem hiding this comment.
[Critical] The PEM regex [\w\s]+ between BEGIN and PRIVATE\s+KEY greedily consumes PRIVATE KEY itself, so the PKCS#8 unencrypted format -----BEGIN PRIVATE KEY----- (no qualifier word) fails to match. PKCS#8 is the modern standard — default output of openssl pkcs8, used by Java keystores, Go crypto/x509, and GCP service account keys.
| Format | Matches? |
|---|---|
BEGIN RSA PRIVATE KEY |
✅ |
BEGIN EC PRIVATE KEY |
✅ |
BEGIN ENCRYPTED PRIVATE KEY |
✅ |
BEGIN PRIVATE KEY |
❌ |
Make the qualifier optional:
| r"-----BEGIN\s+[\w\s]+PRIVATE\s+KEY-----.*?-----END\s+[\w\s]+PRIVATE\s+KEY-----", | |
| r"-----BEGIN\s+(?:[\w\s]+\s+)?PRIVATE\s+KEY-----.*?-----END\s+(?:[\w\s]+\s+)?PRIVATE\s+KEY-----", |
— qwen3.7-max via Qwen Code /review
| return digest.hexdigest() | ||
|
|
||
|
|
||
| def is_sensitive_path(rel_path: str) -> bool: |
There was a problem hiding this comment.
[Critical] is_sensitive_path misses .env.local, .env.production, .envrc, .credentials.json, .token.json, and .auth.json.
Root cause: rsplit(".", 1)[0] on .env.local yields .env (leading dot kept), which is not in SENSITIVE_PATH_PARTS. The part[1:] check yields env.local (extension not stripped). For .credentials.json, stem is .credentials, part[1:] is credentials.json — neither matches.
These are among the most common files for API keys and secrets (Next.js, Rails, Django, direnv). When not flagged, capture_text reads their full contents with only token-pattern redaction.
| def is_sensitive_path(rel_path: str) -> bool: | |
| def is_sensitive_path(rel_path: str) -> bool: | |
| # ... existing checks ... | |
| stem = basename.rsplit(".", 1)[0] if "." in basename else basename | |
| if stem and stem in SENSITIVE_PATH_PARTS: | |
| return True | |
| # Handle hidden files: .credentials.json → stem is ".credentials" | |
| if stem and stem.startswith(".") and stem[1:] in SENSITIVE_PATH_PARTS: | |
| return True | |
| # Check each dot-separated segment | |
| for segment in basename.split("."): | |
| if segment and segment in SENSITIVE_PATH_PARTS: | |
| return True |
— qwen3.7-max via Qwen Code /review
| SSL_CERT_FILE="${ca_file}" \ | ||
| REQUESTS_CA_BUNDLE="${ca_file}" \ | ||
| REPRO_CAPTURE_OUT="${http_out}" \ | ||
| "$@" >"${out_dir}/command.stdout" 2>"${out_dir}/command.stderr" |
There was a problem hiding this comment.
[Critical] Five output files (env.txt, mitm.log, command.stdout, command.stderr, command.exit) are created with default umask (typically 0o644, world-readable). Only http.jsonl gets chmod 0o600 (in llm_dump.py:141). This is inconsistent — capture_state.py correctly applies 0o600 to all its outputs.
mitm.log may contain proxy debug output with sensitive data. command.stdout/command.stderr capture unfiltered CLI output which may include API keys or tokens printed by the wrapped agent.
| "$@" >"${out_dir}/command.stdout" 2>"${out_dir}/command.stderr" | |
| # Before the main redirections, or after: | |
| for f in "${out_dir}/env.txt" "${out_dir}/mitm.log" \ | |
| "${out_dir}/command.stdout" "${out_dir}/command.stderr" \ | |
| "${out_dir}/command.exit"; do | |
| chmod 600 "${f}" 2>/dev/null || true | |
| done |
— qwen3.7-max via Qwen Code /review
| "created_at": now_iso(), | ||
| "agent": args.agent, | ||
| "root": str(root), | ||
| "root_exists": root.exists(), |
There was a problem hiding this comment.
[Critical] When root does not exist, write_snapshot produces a manifest with "root_exists": false and "entries": {}, writes it to disk, and returns 0 with no stderr warning. A caller checking $? sees success and proceeds as if the snapshot were meaningful.
At 3 AM: capture_state.py snapshot --agent codex on a machine where ~/.codex doesn't exist → empty baseline → all files show as "added" in the diff with no explanation.
| "root_exists": root.exists(), | |
| if not root.exists(): | |
| print(f"WARNING: root does not exist: {root}", file=sys.stderr) |
Consider also returning a non-zero exit code.
— qwen3.7-max via Qwen Code /review
| -s "${script_dir}/llm_dump.py" \ | ||
| >"${mitm_log}" 2>&1 & | ||
|
|
||
| mitm_pid="$!" |
There was a problem hiding this comment.
[Critical] The startup health check (line 57) verifies mitmdump is alive before the command starts, but there's no equivalent check after the command finishes. If mitmdump is OOM-killed or crashes mid-capture, records stop being written to http.jsonl with no diagnostic. The cleanup trap silently reaps the dead process.
The operator sees a shorter-than-expected http.jsonl with no indication that records were lost — indistinguishable from "the agent only made N requests".
| mitm_pid="$!" | |
| "$@" >"${out_dir}/command.stdout" 2>"${out_dir}/command.stderr" | |
| status=$? | |
| if ! kill -0 "${mitm_pid}" >/dev/null 2>&1; then | |
| echo "WARNING: mitmdump exited during capture. Check mitm.log for details." >&2 | |
| echo "mitm_log=${mitm_log}" >&2 | |
| fi |
— qwen3.7-max via Qwen Code /review
| ) | ||
|
|
||
|
|
||
| def response(flow: http.HTTPFlow) -> None: |
There was a problem hiding this comment.
[Suggestion] The mitmproxy addon only implements response(). mitmproxy fires the error() hook (not response()) for connection-level failures: DNS errors, connection refused, TLS handshake failures, upstream timeouts. These failed flows are silently omitted from the JSONL capture.
If a reference agent makes 10 LLM API calls but 3 fail at the connection level, the trace records only 7 requests. The normalized trace reports request_count: 7. Compared against a Qwen Code trace where all 10 succeed, the comparison shows 3 "extra" requests rather than 3 failures — incorrect diagnosis.
Add an error handler:
def error(flow: http.HTTPFlow) -> None:
record = {
"ts": time.time(),
"request": {
"method": flow.request.method,
"url": _redact_url(flow.request.pretty_url),
"headers": _headers(flow.request.headers),
"body": _decode(flow.request.content),
},
"response": None,
"error": str(flow.error) if flow.error else None,
}
_write_record(record)— qwen3.7-max via Qwen Code /review
| return value | ||
|
|
||
|
|
||
| def walk_tools(value: Any) -> list[dict[str, Any]]: |
There was a problem hiding this comment.
[Suggestion] walk_tools() doesn't handle Gemini's functionDeclarations format. Gemini tools use {"tools": [{"functionDeclarations": [{"name": "...", "parameters": {...}}]}]} — summarize_tool() receives {"functionDeclarations": [...]} and finds no function, parameters, or input_schema keys, producing an empty summary.
The capture layer is provider-agnostic (mitmproxy intercepts all HTTP), but normalization is blind to Gemini tool definitions. Trace comparison would report "no differences" for tool sets that are completely different.
| def walk_tools(value: Any) -> list[dict[str, Any]]: | |
| def walk_tools(value: Any) -> list[dict[str, Any]]: | |
| tools: list[dict[str, Any]] = [] | |
| if isinstance(value, dict): | |
| for key in ("tools", "functions"): | |
| if key in value and isinstance(value[key], list): | |
| for item in value[key]: | |
| # Gemini: {"functionDeclarations": [...]} | |
| if "functionDeclarations" in item and isinstance(item["functionDeclarations"], list): | |
| for decl in item["functionDeclarations"]: | |
| tools.append(summarize_tool({"type": "function", "function": decl})) | |
| elif key == "functions": | |
| tools.append(summarize_tool({"type": "function", "function": item})) | |
| else: | |
| tools.append(summarize_tool(item)) | |
| return tools |
— qwen3.7-max via Qwen Code /review
| return hashlib.sha256(value.encode("utf-8")).hexdigest()[:16] | ||
|
|
||
|
|
||
| def json_body(record: dict[str, Any]) -> Any: |
There was a problem hiding this comment.
[Suggestion] json_body() can return a list (e.g., OpenAI batch API requests: [{"method": "POST", ...}, ...]). In normalize(), body.keys() is guarded by isinstance(body, dict) (returns []), and summarize_body_values() also guards with if not isinstance(body, dict): return {}. Batch API requests silently produce empty body_keys and body_values — all body-level comparison is skipped.
Consider adding a body_kind field or content hash for non-dict bodies:
if isinstance(body, list):
requests[-1]["body_kind"] = "list"
requests[-1]["body_len"] = len(body)
requests[-1]["body_hash"] = hashlib.sha256(json.dumps(body, sort_keys=True).encode()).hexdigest()[:16]— qwen3.7-max via Qwen Code /review
The `/I` flag for case-insensitive matching is a GNU sed extension. On macOS with BSD sed (prior to Sequoia), the flag silently fails to match — meaning `API_KEY=...`, `Secret=...`, `Credential=...` in the captured `redacted_command` would not be redacted in `env.txt` despite the pattern intending to catch them. Replace the case-insensitive alternation with explicit per-letter character classes (`[Aa][Pp][Ii][-_]?[Kk][Ee][Yy]`, etc.). Both BSD and GNU sed accept this form, and the redaction now works identically across platforms. Verified locally on BSD sed (macOS): - `API_KEY=plain` → `API_KEY=<redacted>` - `Secret=abc` → `Secret=<redacted>` - `TOKEN=xyz` → `TOKEN=<redacted>` - `Credential=zzz` → `Credential=<redacted>` - `MyApiKey=v` → `MyApiKey=<redacted>` Addresses review feedback on PR #4118.
`summarize_messages` previously checked only `"system"` (Anthropic
Messages API) and `"instructions"` (OpenAI Responses API) as
top-level body keys. Qwen Code talks to Gemini-shaped endpoints
that carry the system prompt under `"systemInstruction"`
(camelCase) — there are 130+ uses of that key inside
packages/core/src alone.
Without this third key, the normalize tool dropped the system
prompt from the summarized message list when the captured trace
came from a Gemini-shaped request, so `compare_traces.py`
reported a spurious `message_roles` mismatch (one side had a
`role: "system"` entry, the other did not) and could not align
the system-prompt content between the two captures.
Add `"systemInstruction"` to the detection tuple and document the
provider conventions inline.
Verified locally with five synthetic bodies:
- Gemini `systemInstruction` (string)
- Gemini `systemInstruction` (`{role, parts}` shape)
- Anthropic `system` + `messages`
- OpenAI Responses `instructions` + `input`
- bare `messages` (no system prompt)
The first two now correctly emit a `role: "system"` summary entry;
the legacy three are byte-identical.
Addresses review feedback on PR #4118.
`compare_request()` in compare_traces.py looped over messages via `zip(left_messages, right_messages)`, which silently truncates to the shorter list. If one capture had an extra trailing message — typically the feature-relevant prompt or a tool-result block — it was never compared and never reported. The top-level `request_count` check in `main()` catches differences in HTTP-request counts; the `message_roles` diff happened to fire when lists differed, but it only highlighted the role sequence, not which side was longer, by how much, or what content was lost off the tail. Add two diagnostics in `compare_request()` after the role check: 1. An explicit `message_count: L != R` line so the discrepancy is visible even when the role prefixes happen to match (e.g. both sides start `[system, user]` but one has an extra `assistant`). 2. A per-message `missing_in_right` / `extra_in_right` dump for the trailing slice, mirroring the existing request-level handling (`request[idx].missing_in_right` / `extra_in_right`) so the surfaced output is symmetric between the request and message layers. The zip-loop itself stays (it's correct for the overlapping prefix), and the empty-list / equal-length / identical / equal-length-with- different-hash paths are byte-identical to before. Verified locally with five synthetic request pairs: - right missing trailing message → 3 diffs (roles, count, missing) - right has extra trailing message → 3 diffs (roles, count, extra) - equal length, different content_hash → only the hash diff - identical → 0 diffs - both empty → 0 diffs Addresses review feedback on PR #4118.
wenshao
left a comment
There was a problem hiding this comment.
Round 7 review. All three R6 findings addressed cleanly in the incremental diff (44 lines, 3 files):
compare_traces.py:51message-count diagnostic now surfaces an explicitmessage_count: L != Rline and reports the actual content of trailing messages that fell offzip()(both the missing-in-right and extra-in-right paths).normalize_trace.py:138addssystemInstructionto the Gemini/Qwen Code system-prompt detection tuple, with a concise provider-convention comment.run_with_mitm.sh:95replaces the GNU-only/Iflag with portable per-letter character classes forapi_key/token/secret/credential, and documents the BSD-sed rationale above the expression.
CI green 10/10. — qwen3.7-max via Qwen Code /review
Local Verification ReportPR: #4118 — 1. Static Validation
2. E2E Functional Tests (tmux)
3. Security Review
4. Architecture Review
5. File Summary
6. SummaryAll 4 Python scripts compile and function correctly with synthetic test data. All 3 shell scripts parse cleanly and show proper usage guards. E2E testing confirms normalize → compare pipeline works, state snapshot/diff produces correct results, tmux capture creates expected output files, and Recommendation: Ready to merge ✅ |
tanzhenxin
left a comment
There was a problem hiding this comment.
Thanks for the careful iteration here — and thanks to @wenshao for the deep review pass and local verification.
I went through the remaining open threads. Approving — these are opt-in, additive skills with no runtime/product code, and their captures stay local and gitignored, so nothing here is blocking.
A few items are worth a follow-up when convenient (none need to gate this PR):
- Redaction over-matching — the one I'd suggest doing first. In
llm_dump.pythebearer/basic/tokenpatterns have no minimum length, so ordinary text like "basic usage" becomes "Basic [REDACTED]"; and baresession/cookiein the sensitive-key regex redact keys such assession_id, which can blank out the tool-schema fields the align workflow is meant to compare. The{8,}-style patterns already incapture_state.pyresolve both. - Redaction completeness (nice-to-have): PKCS#8
-----BEGIN PRIVATE KEY-----(no qualifier word) and bare JWTs aren't matched; and a fewrun_with_mitm.shoutputs are world-readable whilehttp.jsonlis0600. - Normalization edge cases (nice-to-have): the compared path keeps query strings (ephemeral IDs → spurious diffs), Gemini
functionDeclarationstools aren't summarized, and batch (list-shaped) request bodies compare as empty.
Merging as-is and tracking these as follow-ups. Thanks again for building this!
Summary
Validation
Scope / Risk
Testing Matrix
Testing matrix notes:
Linked Issues / Bugs