Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
},
"metadata": {
"description": "Tsuga toolkit for AI coding agents — two plugins: `telemetry` for OpenTelemetry instrumentation/audits across 9 languages, and `tsuga` for live-platform investigation, dashboards, and the tsuga CLI driver.",
"version": "0.7.8"
"version": "0.7.9"
},
"plugins": [
{
Expand Down
2 changes: 1 addition & 1 deletion plugins/telemetry/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "telemetry",
"description": "Instrumentation-quality plugin: OpenTelemetry SDK setup for Python, Go, Node.js, Java, .NET, Ruby, PHP, Rust, and C++; Collector configuration; OTTL transformations; semantic conventions; signal-choice advice; instrumentation audits (metrics/traces/logs); smoke-testing; and telemetry-debugging skills. Pair with the `tsuga` plugin for the `tsuga-cli` reference when running audit/debug/smoke-test workflows.",
"version": "0.7.8",
"version": "0.7.9",
"author": {
"name": "Tsuga Engineering",
"email": "engineering@tsuga.com"
Expand Down
2 changes: 1 addition & 1 deletion plugins/tsuga/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "tsuga",
"description": "Tsuga platform plugin: the `tsuga` CLI driver (commands, TQL syntax, aggregation bodies, counter math, deep links, cloud/k8s translators) with embedded lookup playbooks for service ownership and reliability review; live-platform investigation skills for service health, errors, latency, and monitor coverage; dashboard building; the incident-investigation orchestrator; and meta-skills for building and validating skill bundles.",
"version": "0.7.8",
"version": "0.7.9",
"author": {
"name": "Tsuga Engineering",
"email": "engineering@tsuga.com"
Expand Down
195 changes: 27 additions & 168 deletions plugins/tsuga/skills/incident-investigation/SKILL.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Branch: change correlation

You were spawned as a change-correlation subagent — this file is yours. Don't assign verdicts or synthesize root causes — produce the change timeline + candidate classifications (`mechanism_confirmed` / `mechanism_plausible` / `area_only` / `ruled_out`) for the orchestrator.

Answer: what changed, was it deployed, and could it plausibly cause the symptom?

## Inputs

- incident window
- at least one repo slug or local repo path. The container mounts its codebases at `{{CODEBASES_DIR}}/` — check there first for subdirectories (each is a git repo you can inspect with `git log`, `git diff`, `git blame`).

## Time-bound rule (hard)

**Nothing past `declared_at` is admissible.** Restrict every `git log`, `git show`, and `gh pr` call to strictly-before the incident start. Every shell you run in this branch must include the time bound:

- `git log --until="$DECLARED_AT" ...`
- `git log --before="$DECLARED_AT" ...`
- `gh pr list --search "merged:<$DECLARED_AT"`

A post-incident PR titled "Fix <exact symptom>" is the answer key leaking backward — do not use it, do not quote it, do not let it validate your leader. If you see one anyway (because a broad query returned it), drop it and rerun with the `--until` bound. Violating this invalidates the verdict.

## Procedure

1. **Map repos.** Case manifest > local paths > git remotes. Prioritize repos containing the affected service, cluster config, or incident hint.
2. **Collect the emitting `file:line` pins from codebase-grep.** The orchestrator spawns codebase-grep subagents for every verbatim signal in the telemetry output. Their results (one `file:line` per signal) are your highest-leverage inputs — each tells you exactly which source file a PR would need to touch to be a real candidate.
3. **Local git first — time-bounded.** Commits strictly before `declared_at`, files changed in config / helm / infra / auth / feature-flag / routing paths. Dirty working tree = not a safe proxy for the incident window. For each `file:line` from step 2, run `git log -L <line>,<line>:<file> --until=<declared_at>` to see which commits modified that specific line before the incident started.
4. **Then `$gh` — time-bounded.** Workflow runs, merged PRs, releases, commits, deployments with `merged:<<declared_at>` / `created:<<declared_at>` filters. A PR is a candidate only when it merged BEFORE `declared_at` AND a deploy completed between its merge and the incident start.
5. **Mechanism fit** per candidate — the strict version:
- Does the PR's diff touch the `file:line` that emits the observed signal? **If no → not a candidate**, regardless of timing.
- If yes: does the diff change the _condition that triggers emission_ or the _value being emitted_? Quote the relevant lines.
- Does the timing align (merge → deploy → incident start)?
- Is there a faster revert or verification step?
6. **Classify each candidate** as one of:
- `mechanism_confirmed` — diff → emitter → observation traces cleanly.
- `mechanism_plausible` — diff touches nearby code that could plausibly affect the observation; not directly on the emitter line.
- `area_only` — diff is in the right repo / service but doesn't touch the emitter.
- `ruled_out` — wrong surface or wrong timing.

## Branch output

```
Most relevant changes:
- <PR/SHA/tag> [evidence: gh_pr | gh_run | gh_release | gh_api | local_git]
emitting signal: "<verbatim error/metric>"
emitting file:line: <path>:<line>
diff touches emitter line?: yes | no
classification: mechanism_confirmed | mechanism_plausible | area_only | ruled_out
Strongest causal candidate:
<change> — timestamp | artifact | surface | deploy status (deployed | merged only) | trace: diff→emitter→observation
Changes ruled out:
- <change> — why it doesn't fit (wrong file, wrong surface, wrong timing, not deployed)
Best verification or rollback step:
<concrete command or action>
```

Deploy status unknown? Say so explicitly and lower candidate confidence.
`area_only` classification? Say so explicitly — don't promote to "strongest candidate" without a mechanism trace.

## No-data fallback

If no repos are mounted (`ENABLE_CODEBASES=0`, no paths given) AND `$gh` is unavailable or returns nothing useful:

Return exactly:

```
Most relevant changes: (none — no repo / gh access)
Strongest causal candidate: (unavailable)
Best verification or rollback step: Operator should check deploy timestamps and config-change audit out-of-band.
```

Do not infer changes from telemetry alone. Do not cite PRs / SHAs / file paths that were not actually retrieved.

## Branch guardrails

- `merged` ≠ `deployed`.
- No blame without a traced mechanism: `diff → emitter → observation`. "Area match" is not a mechanism.
- Current checkout ≠ incident-window state.
- File paths, SHAs, PR numbers, run URLs over vague prose.
- Never skip codebase-grep. If the orchestrator didn't spawn it, spawn it yourself for the signals you're trying to explain before proposing a PR candidate.
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Branch: telemetry sweep

You were spawned as a telemetry-sweep subagent — this file is everything you need. You don't synthesize a verdict; you surface facts + verbatim signals for codebase-grep + a completeness-check report.

Read-only evidence gathering from Tsuga. This branch does not declare root cause on its own — it produces facts the orchestrator synthesizes.

## Inputs

Time window, scope hint (service / cluster / env / customer / monitor), reported symptom. Missing scope → return only the discovery steps needed to resolve it.

## Procedure

1. **Monitor anchor (if the case cites a monitor).** `tsuga monitors get <monitor-id>` FIRST. Read the monitor's filter, aggregation, threshold, and groupBy — this IS the exact telemetry shape that crossed. Re-run the same query against the incident window AND a control window (same weekday + hour, 7 days earlier). Record the crossed value, the control value, and the ratio.
2. **Normalize scope.** `tsuga services list|get`. Capture canonical name, env, team, versions, sources, 24h log/trace counts.
3. **Normalize session.** Check `tsuga config` (active key, default cluster). Always set explicit `--from`, `--to`, `--max-results`. For `tsuga aggregation`, convert windows to epoch seconds; on multi-cluster orgs include `"clusterId"` in the body.
4. **Load tech knowledge.** If scope names a known tech (Postgres, Redis, Kafka, …), load the matching `$knowledge-technology` reference to target the sweep.
5. **Config-threshold preflight (capacity-shaped symptoms).** If the reported symptom is capacity-shaped — queue lag, `CrashLoopBackOff`, `OOMKilled`, throttling, "too many", "insufficient" — spend one probe asking _"is there a single config knob that would fix this?"_ before any elaborate change-correlation. Grep mounted codebases / helm / Pulumi for patterns like `*BatchSize`, `*PoolSize`, `*Concurrency`, `*MaxConnections`, `*FailureThreshold`, `*InFlightBatches`, `*MemPoolSize` scoped to the affected service.
6. **Evidence sweep.** Prefer in order:
- `logs new-error-patterns` / `logs error-pattern-increases` (when team scope exists)
- `logs patterns` to cluster failure shapes
- `logs search` only after pattern discovery
- `traces search` for exact failing spans
- `aggregation scalar|timeseries` for counts, rates, comparisons
- `monitors list|get` for signal semantics (not live truth)
- `dashboards list|get` / `quality-reports list` as supporting context only
7. **Compare.** Bad window vs good control window. Affected entity vs sibling healthy entity when possible.
8. **Surface verbatim signals for codebase-grep.** As the sweep produces error strings, log patterns, metric names, and monitor filters, emit them as a distinct list at the end of the output — one per line. The orchestrator will spawn a codebase-grep subagent per entry to pin each signal to its emitting `file:line`. Do not try to explain what a signal _means_ until its emitting code is found.
9. **Write evidence matrix.** Four columns: symptom evidence | subsystem evidence | mechanism clues | unknowns. If evidence only supports subsystem diagnosis, say so.
10. **Sweep completeness check.** Before returning, tick these boxes — if any is unchecked and cheap to resolve, do it now:
- [ ] Service metadata resolved (canonical name, team, env, 24h counters)
- [ ] Monitor's own query pulled + replayed against bad + control windows (if case came from a monitor)
- [ ] Config-threshold preflight done (for capacity-shaped symptoms)
- [ ] Error-log patterns scanned (`new-error-patterns` OR `patterns`)
- [ ] Primary metric aggregated in bad window AND control window
- [ ] At least one trace from the failing path inspected (when traces exist)
- [ ] Recent-deploy correlation asked (even if the answer is "no data here — defer to change branch")
- [ ] Verbatim signals surfaced for the codebase-grep branch

## No-data honesty

A metric or log pattern being absent is NOT equivalent to its value being zero. Causes of absence include receiver scope / permission issues, feature not enabled, instrumentation gap, or scrape failure.

When you checked and found nothing, say which: `(absent)` — did not appear in window `|` `(not instrumented)` — scope lacks the receiver `|` `(denied)` — permission error `|` `(empty)` — query ran, returned 0 rows. Never report silent absence as "metric is zero" in a Validated claim.

## Branch output

```
Observed symptom: <one sentence>
Monitor anchor: <monitor id + filter + threshold, or (none) if case didn't cite a monitor>
Confirmed failing subsystem: <one sentence, or (unknown)>
Signals that support it:
- <fact with exact value> [evidence: tsuga_logs | tsuga_traces | tsuga_aggregation | tsuga_monitors | service_metadata]
Signals that do not yet support causality:
- <what you checked that was silent>
Control-window comparison: <bad vs good: counts / rates / ratio, or (skipped) with reason>
Verbatim signals for codebase-grep:
- "exact error string 1"
- "exact error string 2"
- metric.name.to_grep
- log pattern
Config-threshold preflight result: <summary, or (N/A — non-capacity symptom)>
Best next non-Tsuga check: <one action, or (none)>
```

Every claim carries `[evidence: …]`. No tag = hypothesis, belongs in the non-causal section.

## Branch guardrails

- `metrics list|get` is metadata. Use `aggregation` for values.
- Monitor definitions are clues, not live truth.
- Do not claim deploy or config causality from Tsuga alone.
- Exact counts and windows > prose summaries.
- Stop at `symptom diagnosis only` when you can't tie the subsystem to a trigger.

More detail on `tsuga` command patterns: [tsuga-rules.md](./tsuga-rules.md).
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,13 @@ Numbered (plural — avoid single-root-cause framing). Each carries:
file:line pins · *Verified:* how it was confirmed [evidence: tag].

## Mitigation & action items
One line for what was done during the incident (or "none required").
| # | Action | Owner | Ticket | Status |
Owner = team, not person. Ticket "—" until filed.
**Mitigation:** what stopped the bleeding during the incident — the action, when it
took effect, and whether it's temporary (cause still live) or also the durable fix.
"None required" if impact self-resolved or was never user-visible.

Then the durable corrective + preventive follow-ups:
| # | Action | Type | Owner | Ticket | Status |
Type = root-fix / preventive. Owner = team, not person. Ticket "—" until filed.
End with **Verify fixes:** the observable signal that proves each fix worked.

## Open questions
Expand Down
Loading