tsuga-dev · ArthurVerrez · Jun 18, 2026 · Jun 18, 2026 · Jun 18, 2026
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -6,7 +6,7 @@
   },
   "metadata": {
     "description": "Tsuga toolkit for AI coding agents — two plugins: `telemetry` for OpenTelemetry instrumentation/audits across 9 languages, and `tsuga` for live-platform investigation, dashboards, and the tsuga CLI driver.",
-    "version": "0.7.8"
+    "version": "0.7.9"
   },
   "plugins": [
     {

diff --git a/plugins/telemetry/.claude-plugin/plugin.json b/plugins/telemetry/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "telemetry",
   "description": "Instrumentation-quality plugin: OpenTelemetry SDK setup for Python, Go, Node.js, Java, .NET, Ruby, PHP, Rust, and C++; Collector configuration; OTTL transformations; semantic conventions; signal-choice advice; instrumentation audits (metrics/traces/logs); smoke-testing; and telemetry-debugging skills. Pair with the `tsuga` plugin for the `tsuga-cli` reference when running audit/debug/smoke-test workflows.",
-  "version": "0.7.8",
+  "version": "0.7.9",
   "author": {
     "name": "Tsuga Engineering",
     "email": "engineering@tsuga.com"

diff --git a/plugins/tsuga/.claude-plugin/plugin.json b/plugins/tsuga/.claude-plugin/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "tsuga",
   "description": "Tsuga platform plugin: the `tsuga` CLI driver (commands, TQL syntax, aggregation bodies, counter math, deep links, cloud/k8s translators) with embedded lookup playbooks for service ownership and reliability review; live-platform investigation skills for service health, errors, latency, and monitor coverage; dashboard building; the incident-investigation orchestrator; and meta-skills for building and validating skill bundles.",
-  "version": "0.7.8",
+  "version": "0.7.9",
   "author": {
     "name": "Tsuga Engineering",
     "email": "engineering@tsuga.com"

diff --git a/plugins/tsuga/skills/incident-investigation/SKILL.md b/plugins/tsuga/skills/incident-investigation/SKILL.md
diff --git a/...ins/tsuga/skills/incident-investigation/references/branch-change-correlation.md b/...ins/tsuga/skills/incident-investigation/references/branch-change-correlation.md
@@ -0,0 +1,79 @@
+# Branch: change correlation
+
+You were spawned as a change-correlation subagent — this file is yours. Don't assign verdicts or synthesize root causes — produce the change timeline + candidate classifications (`mechanism_confirmed` / `mechanism_plausible` / `area_only` / `ruled_out`) for the orchestrator.
+
+Answer: what changed, was it deployed, and could it plausibly cause the symptom?
+
+## Inputs
+
+- incident window
+- at least one repo slug or local repo path. The container mounts its codebases at `{{CODEBASES_DIR}}/` — check there first for subdirectories (each is a git repo you can inspect with `git log`, `git diff`, `git blame`).
+
+## Time-bound rule (hard)
+
+**Nothing past `declared_at` is admissible.** Restrict every `git log`, `git show`, and `gh pr` call to strictly-before the incident start. Every shell you run in this branch must include the time bound:
+
+- `git log --until="$DECLARED_AT" ...`
+- `git log --before="$DECLARED_AT" ...`
+- `gh pr list --search "merged:<$DECLARED_AT"`
+
+A post-incident PR titled "Fix <exact symptom>" is the answer key leaking backward — do not use it, do not quote it, do not let it validate your leader. If you see one anyway (because a broad query returned it), drop it and rerun with the `--until` bound. Violating this invalidates the verdict.
+
+## Procedure
+
+1. **Map repos.** Case manifest > local paths > git remotes. Prioritize repos containing the affected service, cluster config, or incident hint.
+2. **Collect the emitting `file:line` pins from codebase-grep.** The orchestrator spawns codebase-grep subagents for every verbatim signal in the telemetry output. Their results (one `file:line` per signal) are your highest-leverage inputs — each tells you exactly which source file a PR would need to touch to be a real candidate.
+3. **Local git first — time-bounded.** Commits strictly before `declared_at`, files changed in config / helm / infra / auth / feature-flag / routing paths. Dirty working tree = not a safe proxy for the incident window. For each `file:line` from step 2, run `git log -L <line>,<line>:<file> --until=<declared_at>` to see which commits modified that specific line before the incident started.
+4. **Then `$gh` — time-bounded.** Workflow runs, merged PRs, releases, commits, deployments with `merged:<<declared_at>` / `created:<<declared_at>` filters. A PR is a candidate only when it merged BEFORE `declared_at` AND a deploy completed between its merge and the incident start.
+5. **Mechanism fit** per candidate — the strict version:
+   - Does the PR's diff touch the `file:line` that emits the observed signal? **If no → not a candidate**, regardless of timing.
+   - If yes: does the diff change the _condition that triggers emission_ or the _value being emitted_? Quote the relevant lines.
+   - Does the timing align (merge → deploy → incident start)?
+   - Is there a faster revert or verification step?
+6. **Classify each candidate** as one of:
+   - `mechanism_confirmed` — diff → emitter → observation traces cleanly.
+   - `mechanism_plausible` — diff touches nearby code that could plausibly affect the observation; not directly on the emitter line.
+   - `area_only` — diff is in the right repo / service but doesn't touch the emitter.
+   - `ruled_out` — wrong surface or wrong timing.
+
+## Branch output
+
+```
+Most relevant changes:
+  - <PR/SHA/tag> [evidence: gh_pr | gh_run | gh_release | gh_api | local_git]
+    emitting signal: "<verbatim error/metric>"
+    emitting file:line: <path>:<line>
+    diff touches emitter line?: yes | no
+    classification: mechanism_confirmed | mechanism_plausible | area_only | ruled_out
+Strongest causal candidate:
+  <change> — timestamp | artifact | surface | deploy status (deployed | merged only) | trace: diff→emitter→observation
+Changes ruled out:
+  - <change> — why it doesn't fit (wrong file, wrong surface, wrong timing, not deployed)
+Best verification or rollback step:
+  <concrete command or action>
+```
+
+Deploy status unknown? Say so explicitly and lower candidate confidence.
+`area_only` classification? Say so explicitly — don't promote to "strongest candidate" without a mechanism trace.
+
+## No-data fallback
+
+If no repos are mounted (`ENABLE_CODEBASES=0`, no paths given) AND `$gh` is unavailable or returns nothing useful:
+
+Return exactly:
+
+```
+Most relevant changes: (none — no repo / gh access)
+Strongest causal candidate: (unavailable)
+Best verification or rollback step: Operator should check deploy timestamps and config-change audit out-of-band.
+```
+
+Do not infer changes from telemetry alone. Do not cite PRs / SHAs / file paths that were not actually retrieved.
+
+## Branch guardrails
+
+- `merged` ≠ `deployed`.
+- No blame without a traced mechanism: `diff → emitter → observation`. "Area match" is not a mechanism.
+- Current checkout ≠ incident-window state.
+- File paths, SHAs, PR numbers, run URLs over vague prose.
+- Never skip codebase-grep. If the orchestrator didn't spawn it, spawn it yourself for the signals you're trying to explain before proposing a PR candidate.
diff --git a/plugins/tsuga/skills/incident-investigation/references/branch-telemetry-sweep.md b/plugins/tsuga/skills/incident-investigation/references/branch-telemetry-sweep.md
@@ -0,0 +1,75 @@
+# Branch: telemetry sweep
+
+You were spawned as a telemetry-sweep subagent — this file is everything you need. You don't synthesize a verdict; you surface facts + verbatim signals for codebase-grep + a completeness-check report.
+
+Read-only evidence gathering from Tsuga. This branch does not declare root cause on its own — it produces facts the orchestrator synthesizes.
+
+## Inputs
+
+Time window, scope hint (service / cluster / env / customer / monitor), reported symptom. Missing scope → return only the discovery steps needed to resolve it.
+
+## Procedure
+
+1. **Monitor anchor (if the case cites a monitor).** `tsuga monitors get <monitor-id>` FIRST. Read the monitor's filter, aggregation, threshold, and groupBy — this IS the exact telemetry shape that crossed. Re-run the same query against the incident window AND a control window (same weekday + hour, 7 days earlier). Record the crossed value, the control value, and the ratio.
+2. **Normalize scope.** `tsuga services list|get`. Capture canonical name, env, team, versions, sources, 24h log/trace counts.
+3. **Normalize session.** Check `tsuga config` (active key, default cluster). Always set explicit `--from`, `--to`, `--max-results`. For `tsuga aggregation`, convert windows to epoch seconds; on multi-cluster orgs include `"clusterId"` in the body.
+4. **Load tech knowledge.** If scope names a known tech (Postgres, Redis, Kafka, …), load the matching `$knowledge-technology` reference to target the sweep.
+5. **Config-threshold preflight (capacity-shaped symptoms).** If the reported symptom is capacity-shaped — queue lag, `CrashLoopBackOff`, `OOMKilled`, throttling, "too many", "insufficient" — spend one probe asking _"is there a single config knob that would fix this?"_ before any elaborate change-correlation. Grep mounted codebases / helm / Pulumi for patterns like `*BatchSize`, `*PoolSize`, `*Concurrency`, `*MaxConnections`, `*FailureThreshold`, `*InFlightBatches`, `*MemPoolSize` scoped to the affected service.
+6. **Evidence sweep.** Prefer in order:
+   - `logs new-error-patterns` / `logs error-pattern-increases` (when team scope exists)
+   - `logs patterns` to cluster failure shapes
+   - `logs search` only after pattern discovery
+   - `traces search` for exact failing spans
+   - `aggregation scalar|timeseries` for counts, rates, comparisons
+   - `monitors list|get` for signal semantics (not live truth)
+   - `dashboards list|get` / `quality-reports list` as supporting context only
+7. **Compare.** Bad window vs good control window. Affected entity vs sibling healthy entity when possible.
+8. **Surface verbatim signals for codebase-grep.** As the sweep produces error strings, log patterns, metric names, and monitor filters, emit them as a distinct list at the end of the output — one per line. The orchestrator will spawn a codebase-grep subagent per entry to pin each signal to its emitting `file:line`. Do not try to explain what a signal _means_ until its emitting code is found.
+9. **Write evidence matrix.** Four columns: symptom evidence | subsystem evidence | mechanism clues | unknowns. If evidence only supports subsystem diagnosis, say so.
+10. **Sweep completeness check.** Before returning, tick these boxes — if any is unchecked and cheap to resolve, do it now:
+    - [ ] Service metadata resolved (canonical name, team, env, 24h counters)
+    - [ ] Monitor's own query pulled + replayed against bad + control windows (if case came from a monitor)
+    - [ ] Config-threshold preflight done (for capacity-shaped symptoms)
+    - [ ] Error-log patterns scanned (`new-error-patterns` OR `patterns`)
+    - [ ] Primary metric aggregated in bad window AND control window
+    - [ ] At least one trace from the failing path inspected (when traces exist)
+    - [ ] Recent-deploy correlation asked (even if the answer is "no data here — defer to change branch")
+    - [ ] Verbatim signals surfaced for the codebase-grep branch
+
+## No-data honesty
+
+A metric or log pattern being absent is NOT equivalent to its value being zero. Causes of absence include receiver scope / permission issues, feature not enabled, instrumentation gap, or scrape failure.
+
+When you checked and found nothing, say which: `(absent)` — did not appear in window `|` `(not instrumented)` — scope lacks the receiver `|` `(denied)` — permission error `|` `(empty)` — query ran, returned 0 rows. Never report silent absence as "metric is zero" in a Validated claim.
+
+## Branch output
+
+```
+Observed symptom: <one sentence>
+Monitor anchor: <monitor id + filter + threshold, or (none) if case didn't cite a monitor>
+Confirmed failing subsystem: <one sentence, or (unknown)>
+Signals that support it:
+  - <fact with exact value> [evidence: tsuga_logs | tsuga_traces | tsuga_aggregation | tsuga_monitors | service_metadata]
+Signals that do not yet support causality:
+  - <what you checked that was silent>
+Control-window comparison: <bad vs good: counts / rates / ratio, or (skipped) with reason>
+Verbatim signals for codebase-grep:
+  - "exact error string 1"
+  - "exact error string 2"
+  - metric.name.to_grep
+  - log pattern
+Config-threshold preflight result: <summary, or (N/A — non-capacity symptom)>
+Best next non-Tsuga check: <one action, or (none)>
+```
+
+Every claim carries `[evidence: …]`. No tag = hypothesis, belongs in the non-causal section.
+
+## Branch guardrails
+
+- `metrics list|get` is metadata. Use `aggregation` for values.
+- Monitor definitions are clues, not live truth.
+- Do not claim deploy or config causality from Tsuga alone.
+- Exact counts and windows > prose summaries.
+- Stop at `symptom diagnosis only` when you can't tie the subsystem to a trigger.
+
+More detail on `tsuga` command patterns: [tsuga-rules.md](./tsuga-rules.md).
diff --git a/plugins/tsuga/skills/incident-investigation/references/investigation-record.md b/plugins/tsuga/skills/incident-investigation/references/investigation-record.md
@@ -62,9 +62,13 @@ Numbered (plural — avoid single-root-cause framing). Each carries:
 file:line pins · *Verified:* how it was confirmed [evidence: tag].
 
 ## Mitigation & action items
-One line for what was done during the incident (or "none required").
-| # | Action | Owner | Ticket | Status |
-Owner = team, not person. Ticket "—" until filed.
+**Mitigation:** what stopped the bleeding during the incident — the action, when it
+took effect, and whether it's temporary (cause still live) or also the durable fix.
+"None required" if impact self-resolved or was never user-visible.
+
+Then the durable corrective + preventive follow-ups:
+| # | Action | Type | Owner | Ticket | Status |
+Type = root-fix / preventive. Owner = team, not person. Ticket "—" until filed.
 End with **Verify fixes:** the observable signal that proves each fix worked.
 
 ## Open questions