Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ plugins/telemetry/
```

Notes:

- **No top-level `skills/` directory.** Every skill lives under its plugin's subdir.
- **No `.codex-plugin/` directory.** Codex consumes the same `.claude-plugin/marketplace.json` (verified: `~/.codex/.tmp/marketplaces/<name>/.claude-plugin/marketplace.json` is what `codex plugin list` reads).
- **No `skills[]` array in marketplace.json.** Skills are auto-discovered from each plugin's `skills/` directory. The `skills[]` filter is only meaningful in single-plugin marketplaces and is silently ignored when multiple plugins share a `source`.
Expand All @@ -31,6 +32,7 @@ Notes:
- Patch (`0.6.1 → 0.6.2`): content tweaks, doc fixes, clarifications.
- Minor (`0.6.x → 0.7.0`): new skill, new reference doc, new behavior.
- Major (`0.x → 1.0`): breaking change to a SKILL.md contract (rare).

3. Open the PR as a draft. Use the `open-pr` skill if available.

## When adding a new skill
Expand Down Expand Up @@ -164,7 +166,7 @@ Code-reading skills (`signal-choice-advisor`, the `tsuga-audit-*` and `tsuga-smo

## Runtime behavior rules

These govern how skills *behave when executing* (distinct from the authoring/repo conventions above). They apply to all Tsuga skills; each skill's SKILL.md inlines the rules critical to its workflow. This section is the canonical source.
These govern how skills _behave when executing_ (distinct from the authoring/repo conventions above). They apply to all Tsuga skills; each skill's SKILL.md inlines the rules critical to its workflow. This section is the canonical source.

1. **CLI-first, no external calls.** Only `tsuga` commands. No curl, no direct API calls, no web browsing during skill execution.

Expand All @@ -184,7 +186,7 @@ These govern how skills *behave when executing* (distinct from the authoring/rep

9. **Scope containment.** If a request spans multiple skills, address your skill's scope, state what evidence you found, then direct the user to the other skill for the rest.

10. **Stale data acknowledgment.** `monitors list`, `services list`, and `quality-reports list` return config/snapshot state, not live state. State the query time in output. For quality reports, flag if `generatedAt` > 48h ago.
10. **Stale data acknowledgment.** `monitors list`, `services list`, and `quality-reports list` return config/snapshot state, not live state. State the query time in output. For quality reports, derive the report timestamp as `min(rows.createdAt)` and flag if > 48h ago.

### Addendum: instrumentation-quality skills

Expand Down
25 changes: 15 additions & 10 deletions plugins/telemetry/skills/otel-go/references/audit-checklist.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Audit Checklist — Go OTel

> This file covers two audit modes:
>
> 1. **Code anti-patterns** (static) — section "Anti-Patterns to Flag"
> 2. **Cross-signal live audit** (tsuga CLI) — section "Cross-Signal Audit Workflow"
>
Expand Down Expand Up @@ -28,12 +29,12 @@ go list -m go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc

Expected minimum versions:

| Module | Minimum |
|---|---|
| `go.opentelemetry.io/otel` | v1.x (latest stable) |
| `go.opentelemetry.io/otel/sdk` | v1.x (latest stable) |
| `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc` | v1.x |
| `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc` | v1.x |
| Module | Minimum |
| ------------------------------------------------------------------- | -------------------- |
| `go.opentelemetry.io/otel` | v1.x (latest stable) |
| `go.opentelemetry.io/otel/sdk` | v1.x (latest stable) |
| `go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc` | v1.x |
| `go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc` | v1.x |

Check for module version consistency — `otel`, `otel/sdk`, and `otel/exporters` packages should use the same major version.

Expand Down Expand Up @@ -156,6 +157,7 @@ tsuga metrics list --filter "service.name=<your-service>"
```

If no spans appear:

1. Check `OTEL_EXPORTER_OTLP_ENDPOINT` — for gRPC, do not include `http://`
2. Verify the collector is reachable: `grpc_cli ls localhost:4317`
3. Confirm `otel.SetTracerProvider(tp)` is called before any `otel.Tracer()` calls
Expand All @@ -179,6 +181,7 @@ tsuga services get <id>
Capture `sources[]`, `logsCount24h`, `tracesCount24h`, `errorLogsCount24h`, `errorTracesCount24h`.

Classify:

- No signals at all → stop; direct to `otel-instrumentation`
- Logs only / Traces only / Metrics only → note what's missing; continue what's present
- All 3 present → continue full audit
Expand Down Expand Up @@ -230,20 +233,22 @@ In sampled metric attributes: flag `user.id`, `request.id`, raw URL paths, trace
### Step 9 — Quality report

```bash
tsuga quality-reports list
tsuga quality-reports list # add `--cluster <id>` in multi-cluster orgs
```

Note `generatedAt` — flag as stale if > 48h ago. Note rule failures attributed to this service.
Returns a flat array of rule-evaluation rows. Filter `.[] | select(.status == "failed")` for rule failures attributed to this service. Report timestamp = `min(.[].createdAt)`; flag as stale if > 48h ago.

### Step 10 — Source code check (if path provided)

Read SDK init file. Check for:

- `TracerProvider`, `MeterProvider`, `LoggerProvider` initialization
- Log bridge setup (`otelslog.NewHandler` or `otelzap.NewCore`)
- Resource attributes set at init (`service.name`, `service.version`, `deployment.environment.name`)
- Exporter endpoint: **fail** if hardcoded (`WithEndpoint("http://localhost:4317")` in code); **pass** if zero-arg constructor AND `OTEL_EXPORTER_OTLP_ENDPOINT` is in deployment config

If metric creation code is available, check instrument type:

- Description containing "current", "active", "open", "in-flight", or "pending" → must use `UpDownCounter` (not `Counter`)
- Description containing "duration", "latency", "time", or "size" → must use `Histogram` (not `Counter` or `Gauge`)

Expand Down Expand Up @@ -300,7 +305,7 @@ service.name: <present/MISSING> | service.version: <present/MISSING>
Endpoint hardcoded: ✅ no / ❌ yes (file:line) | OTEL_EXPORTER_OTLP_ENDPOINT configured: ✅ / ❌ / ⚠️ not checked

## Quality Report
<N rule failures / No failures / ⚠️ Report stale — generated: <generatedAt>>
<N rule failures / No failures / ⚠️ Report stale — generated: <min(rows.createdAt)>>

## Recommended Actions (in order of impact)
1. <highest-impact fix> — run `<specific skill>` for detailed guidance
Expand All @@ -310,7 +315,7 @@ Endpoint hardcoded: ✅ no / ❌ yes (file:line) | OTEL_EXPORTER_OTLP_ENDPOINT c
- Condensed audit: 5-record samples per signal; use tsuga-audit-metrics / tsuga-audit-logs / tsuga-audit-traces for full per-signal analysis
- Metric check covers spot-check of up to 3 metric names only
- Source code findings (if any) are from the provided path only
- Quality report reflects state at generatedAt, not live state
- Quality report reflects state at the row `createdAt` timestamps, not live state
```

---
Expand Down
42 changes: 23 additions & 19 deletions plugins/telemetry/skills/tsuga-audit-metrics/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
name: tsuga-audit-metrics
description: "Use when asked to review metric design, check metric naming, audit metric quality, or validate a custom metric."
description: 'Use when asked to review metric design, check metric naming, audit metric quality, or validate a custom metric.'
---

# Tsuga: Audit Metrics
Expand Down Expand Up @@ -29,7 +29,7 @@ description: "Use when asked to review metric design, check metric naming, audit

5. For code-side audit (instrument type correctness, OTel SDK init, metric view configuration) → `otel-<lang>/references/audit-checklist.md`

6. `tsuga quality-reports list` — check if any rule failures relate to metric naming or instrumentation for this service. Note `generatedAt`; flag if > 48h ago.
6. `tsuga quality-reports list` (pass `--cluster <id>` in multi-cluster orgs) — flat array of rule-evaluation rows. Filter `.[] | select(.status == "failed")` for failures related to metric naming or instrumentation for this service. Report timestamp = `min(.[].createdAt)`; flag if > 48h ago.

7. **Stop and validate with user.** Before presenting final findings, share preliminary observations: "Here is what I found so far — [summary of naming issues, instrument type concerns, cardinality risks]. Does this match your understanding of how metrics are instrumented in this service?" Adjust findings based on user context before concluding.

Expand All @@ -47,15 +47,15 @@ Source: OTel specification (https://opentelemetry.io/docs/specs/otel/metrics/sup
| `{request}` | Request counts |
Never encode units in the metric name — set the `unit` field instead.

| Rule | Bad example | Good example |
|---|---|---|
| No service name in metric name | `web_backend_request_count` | `http.server.request.count` (service identity in `context.service.name`) |
| No environment/version in metric name | `prod_latency_ms`, `v2_errors` | `http.server.request.duration` |
| No units in metric name | `latency_ms`, `memory_bytes` | `request.duration` with `unit: ms` |
| OTel dot notation (not underscores) | `http_server_request_duration` | `http.server.request.duration` |
| Verb-object pattern | `duration` (missing namespace) | `http.server.request.duration` |
| Counter must not describe current state | `active_connections` as Counter | `active_connections` as UpDownCounter |
| Histogram for distributions | `request_duration` as Counter | `request.duration` as Histogram |
| Rule | Bad example | Good example |
| --------------------------------------- | ------------------------------- | ------------------------------------------------------------------------ |
| No service name in metric name | `web_backend_request_count` | `http.server.request.count` (service identity in `context.service.name`) |
| No environment/version in metric name | `prod_latency_ms`, `v2_errors` | `http.server.request.duration` |
| No units in metric name | `latency_ms`, `memory_bytes` | `request.duration` with `unit: ms` |
| OTel dot notation (not underscores) | `http_server_request_duration` | `http.server.request.duration` |
| Verb-object pattern | `duration` (missing namespace) | `http.server.request.duration` |
| Counter must not describe current state | `active_connections` as Counter | `active_connections` as UpDownCounter |
| Histogram for distributions | `request_duration` as Counter | `request.duration` as Histogram |

## GOOD/BAD Examples

Expand All @@ -80,11 +80,14 @@ Never encode units in the metric name — set the `unit` field instead.
### Cardinality Examples

**BAD:** Metric dimension `user_id` (or `user.id`)

```
histogram.record(duration, {"user.id": userId, "http.route": "/api/orders"})
```

**Why wrong:** `user_id` has unbounded cardinality — one unique value per user → millions of time series.
**GOOD:** Remove `user.id` from metric dimensions; use it as a span attribute instead:

```
histogram.record(duration, {"http.route": "/api/orders"}) // Low-cardinality dims only
span.setAttribute("user.id", userId) // High-cardinality → spans
Expand All @@ -98,12 +101,12 @@ span.setAttribute("user.id", userId) // High-cardinality

## Instrument Type Rules

| Instrument | Correct use |
|---|---|
| Counter | Monotonically increasing totals (requests processed, errors, retries) |
| UpDownCounter | Fluctuating current values (queue depth, active connections, cache size) |
| Histogram | Distributions — latency, payload size, queue wait time |
| Observable Gauge | Values sampled at collection time (CPU %, memory usage, thread count) |
| Instrument | Correct use |
| ---------------- | ------------------------------------------------------------------------ |
| Counter | Monotonically increasing totals (requests processed, errors, retries) |
| UpDownCounter | Fluctuating current values (queue depth, active connections, cache size) |
| Histogram | Distributions — latency, payload size, queue wait time |
| Observable Gauge | Values sampled at collection time (CPU %, memory usage, thread count) |

**Detection heuristic (source: CLI):** If a metric's description contains "current", "active", "open", "in-flight", or "pending" AND its type is Counter: likely wrong instrument type.

Expand Down Expand Up @@ -141,8 +144,8 @@ Metrics audited: <N> | Naming issues: <N> | Instrument type issues: <N> | Cardin
| <metric.name> | <attr> | >100 (limit reached) | High — possible per-request identifier | tsuga CLI |

## Quality Report Correlation
<N> quality rule failures relate to metric instrumentation (report generated: <generatedAt>)
[If generatedAt > 48h ago: ⚠️ Quality report is stale — results may not reflect current state]
<N> quality rule failures relate to metric instrumentation (report generated: <min(rows.createdAt)>)
[If derived timestamp > 48h ago: ⚠️ Quality report is stale — results may not reflect current state]

## Recommended Actions
1. Rename <metric.name> to <corrected.name> — eliminates service-identity-in-name violation
Expand All @@ -157,6 +160,7 @@ Metrics audited: <N> | Naming issues: <N> | Instrument type issues: <N> | Cardin
```

## Related Skills / Next Steps

- `tsuga-metric-naming-fix` — apply renames found in this audit
- `tsuga-smoke-test` — verify metrics after fixes
- `otel-<lang>/references/audit-checklist.md` — code-side audit (instrument type, SDK init, metric views)
Expand Down
35 changes: 18 additions & 17 deletions plugins/tsuga/skills/tsuga-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,12 +101,13 @@ tsuga investigations create -d '{
## Quality Reports

```bash
tsuga quality-reports list # get the latest quality report
tsuga quality-reports list | jq '.report.overallScore'
tsuga quality-reports list | jq '.report.teamResults[].teamName'
tsuga quality-reports list # rows for the cluster's latest report
tsuga quality-reports list --cluster <id> # multi-cluster orgs MUST pass --cluster (else 400)
tsuga quality-reports list | jq '.[] | select(.status=="failed") | {ruleId, owner}'
tsuga quality-reports list | jq 'group_by(.owner) | map({owner: .[0].owner, score: .[0].reportOverallScore})'
```

Key fields: `report.generatedAt`, `report.overallScore`, `report.teamResults[]{teamId, teamName, score, weightedScore, ruleResults[]}`. Generated at least once per 24h. Only supports `list`.
Returns a flat list of rows (one per persisted rule evaluation). Key fields per row: `id`, `reportId`, `ruleId`, `owner` (team id or absent for global), `status` (`passed`/`failed`/`ignored`), `score` (0..1), `weight`, `reportOverallScore` and `reportTotalWeight` (per-owner aggregates, identical across rows of the same `reportId`), `createdAt`, optional `recommendation` and `examples`. Generated at least once per 24h. Only supports `list`.

## Services

Expand Down Expand Up @@ -163,18 +164,18 @@ Output: `{"nodes": [{"serviceName": ...}], "edges": [{"from": <svc>, "to": <svc>

Applies to all `--query` flags and aggregation `filter` fields.

| Syntax | Meaning | Example |
| ---------------------- | ------------------------------- | ---------------------------------------------------------------- |
| `field:value` | Exact match | `level:ERROR` |
| `term1 term2` | AND (default) | `level:ERROR context.service.name:api` |
| `term1 OR term2` | OR (must be uppercase) | `level:ERROR OR level:WARN` |
| `NOT term` | Negation (uppercase) | `level:ERROR NOT context.env:staging` |
| `(...)` | Grouping | `(level:ERROR OR level:WARN) context.service.name:api` |
| `(field:a OR field:b)` | Multiple values for one field | `(context.service.name:web-backend OR context.service.name:api)` |
| `field:*` | Field exists | `trace_id:*` |
| `field:[A TO B]` | Range, inclusive | `duration:[100 TO 500]` |
| `field:>N` | Numeric (also `>=`, `<`) | `duration:>100` |
| `<bare token>` | Free-text token in log message | `refused`, `connection refused`, `(timeout OR refused)` |
| Syntax | Meaning | Example |
| ---------------------- | ------------------------------ | ---------------------------------------------------------------- |
| `field:value` | Exact match | `level:ERROR` |
| `term1 term2` | AND (default) | `level:ERROR context.service.name:api` |
| `term1 OR term2` | OR (must be uppercase) | `level:ERROR OR level:WARN` |
| `NOT term` | Negation (uppercase) | `level:ERROR NOT context.env:staging` |
| `(...)` | Grouping | `(level:ERROR OR level:WARN) context.service.name:api` |
| `(field:a OR field:b)` | Multiple values for one field | `(context.service.name:web-backend OR context.service.name:api)` |
| `field:*` | Field exists | `trace_id:*` |
| `field:[A TO B]` | Range, inclusive | `duration:[100 TO 500]` |
| `field:>N` | Numeric (also `>=`, `<`) | `duration:>100` |
| `<bare token>` | Free-text token in log message | `refused`, `connection refused`, `(timeout OR refused)` |

AND is the default (space-separated terms); OR binds looser than AND. `OR` and `NOT` must be uppercase — lowercase `or`/`not` become literal search terms.

Expand Down Expand Up @@ -272,7 +273,7 @@ Example — per-second rate of a counter:
```

- `timeRange` uses **Unix seconds** (not relative strings like `"-1h"`)
- On a **multi-cluster tenant**, `aggregation scalar|timeseries` needs a top-level `"clusterId": "<id>"` **in the body** — the global `--cluster` flag and `defaults.cluster` are *ignored* for `aggregation` (every other command honors them). Without it: `400 clusterId must be provided when the organization has multiple clusters`.
- On a **multi-cluster tenant**, `aggregation scalar|timeseries` needs a top-level `"clusterId": "<id>"` **in the body** — the global `--cluster` flag and `defaults.cluster` are _ignored_ for `aggregation` (every other command honors them). Without it: `400 clusterId must be provided when the organization has multiple clusters`.
- `dataSource` and `formula` are **body-level** fields; query items do not have `id` or `dataSource`
- `formula` references queries by position: `"q1"` = first query, `"q2"` = second, etc.
- `groupBy` is at **body level** (not inside query items): `[{"fields": ["field.name"], "limit": N}]`
Expand Down
Loading
Loading