Agent Performance Report - Week of 2026-02-26

### Performance Summary

- **Agents analyzed:** 16 (from 31 total runs sampled, past 2 days)
- **Total tokens (sample):** ~165M (includes Codex high-parallelism runs)
- **Total cost (today):** ~$5.94 | yesterday: ~$6.14
- **Average quality score:** 86/100 (↓ 3 from 89)
- **Average effectiveness score:** 87/100 (↓ 1 from 88)
- **Top performers:** The Great Escapi, Contribution Check, Daily Safe Outputs Conformance Checker
- **Needs attention:** AI Moderator (missing tool regression), Chroma Issue Indexer (extreme token usage), Semantic Function Refactoring (elevated cost)

### Critical Findings

**❌ P0 Ongoing: Lockdown Token Failures (3+ weeks)**

4 workflows remain locked out — Issue Monster, PR Triage Agent, Daily Issues Report, Org Health Report. All fix paths closed (`#17414`, `#17807` both rejected as "not_planned"). Manual repo admin intervention required. These failures continue to skew ecosystem quality metrics.

**⚠️ AI Moderator GitHub MCP Missing Tool — Regression Detected**

1 of 3 runs today (run [§22453521501](https://github.com/github/gh-aw/actions/runs/22453521501)) reported missing `GitHub MCP (read issue/comment content)` tool — identical to the Docker MCP intermittency pattern last seen 2026-02-24 that was believed resolved by switching to `mode: remote`. With `mode: remote` now also showing intermittency, the root cause may be upstream GitHub MCP availability rather than Docker-specific. The other 2 runs succeeded but had very low turn counts (1–2 turns), which may indicate noop runs rather than full processing.

**⚠️ Chroma Issue Indexer — Extreme Token Usage**

Today's run consumed **3.6M tokens** in 10.5 minutes with **102 blocked firewall requests** — the highest blocked count of any workflow today. If the issue index is growing, this trend will worsen. The 47% firewall block rate across the ecosystem (439/926 requests blocked) is driven primarily by this workflow and Semantic Function Refactoring.

<details>
<summary>View Detailed Quality Analysis</summary>

#### Agent Quality Scores (Today)

| Agent | Engine | Quality | Duration | Tokens | Cost | Notes |
|-------|--------|---------|----------|--------|------|-------|
| The Great Escapi | copilot | 94/100 | 3.5m | 74k | — | Ultra-efficient |
| Contribution Check | copilot | 93/100 | 2.8m | 181k | — | Fast, clean |
| Daily Safe Outputs Conformance Checker | claude | 92/100 | 3.1m | 134k | $0.33 | Efficient |
| Auto-Triage Issues | copilot | 90/100 | 3.5m | 136k | — | Success |
| Agent Container Smoke Test | copilot | 90/100 | 4.4m | 174k | — | Clean |
| Smoke Copilot | copilot | 90/100 | 6.7m | — | — | 49 turns, passing |
| Smoke Claude | claude | 87/100 | 12.9m | 991k | $1.47 | 42 turns, long |
| Lockfile Statistics Analysis Agent | claude | 87/100 | 5.0m | 456k | $0.82 | 14 turns, normal |
| AI Moderator (×3) | codex | 82/100 | 7.5–8.9m | 210–372k | — | 1/3 missing tool |
| Scout | claude | 80/100 | 4.9m | 613k | $0.81 | 19 turns |
| Smoke Codex | codex | 80/100 | 6.8m | 32M | — | 17 turns, Codex tokens |
| Slide Deck Maintainer | copilot | 78/100 | 6.7m | 1.5M | — | High tokens |
| Changeset Generator | codex | 75/100 | 8.2m | 123M | — | Codex parallelism |
| Semantic Function Refactoring | claude | 72/100 | 9.1m | 295k | $3.97 | High cost, 12 turns |
| Chroma Issue Indexer | copilot | 68/100 | 10.5m | 3.6M | — | Extreme tokens |

#### Cancelled Runs Analysis

14 runs were cancelled in a batch (runs 22450833xxx–22450834xxx). This is expected behavior from a Release workflow trigger — these represent staggered workflow starts that were cancelled before the new release artifacts were ready. Not a quality issue.

</details>

<details>
<summary>View Effectiveness Metrics</summary>

#### Task Completion Rates (Sampled Agent Runs)
- **High completion (>80%):** 13/15 agent workflows (87%)
- **Partial/Degraded:** AI Moderator (1/3 runs degraded), Chroma Issue Indexer (functional but inefficient)
- **Infrastructure failures (not quality):** Issue Monster, PR Triage Agent, Daily Issues Report, Org Health Report (lockdown)

#### Cost Efficiency Trends

| Agent | Today | Yesterday | Δ |
|-------|-------|-----------|---|
| Semantic Function Refactoring | $3.97 | $4.82 | ↓ $0.85 ✅ |
| Scout | $0.81 | — | New data point |
| Daily Safe Outputs Conformance Checker | $0.33 | — | Consistent |
| Lockfile Statistics Analysis Agent | $0.82 | — | Consistent |
| Smoke Claude | $1.47 | — | Long duration |
| **Total (metered)** | **$5.94** | **$6.14** | ↓ $0.20 ✅ |

#### Firewall Request Analysis

Total 926 requests across all workflows: 487 allowed (53%), 439 blocked (47%).

Top blocked workflows:
1. Chroma Issue Indexer: 102 blocked — likely local socket connections (Serena MCP pattern)
2. Semantic Function Refactoring: 72 blocked — consistent with `"-"` domain pattern
3. Changeset Generator: 61 blocked — Codex parallelism reaching out broadly
4. Slide Deck Maintainer: 43 blocked — investigating
5. Smoke Codex: 38 blocked — expected for engine behavior

The `"-"` domain appearing in blocked list is a known Serena MCP local socket artifact (see issue #18388).

</details>

<details>
<summary>View Behavioral Patterns</summary>

#### Productive Patterns ✅
- **Release → Smoke cancellation → Re-run:** Expected orchestration behavior, not a failure
- **Daily Safe Outputs Conformance Checker:** Continues to be highly efficient (3 turns, $0.33)
- **The Great Escapi:** Maintaining minimal footprint, high reliability across 2+ weeks

#### Problematic Patterns ⚠️
- **AI Moderator GitHub MCP intermittency:** 3rd occurrence of missing tool issue. Pattern: `mode: remote` was supposed to fix this (2026-02-24), but 1/3 runs today missing GitHub MCP again. Silent failures — moderation trigger runs but does nothing. Impact: ~33% of moderation events missed.
- **Semantic Function Refactoring high cost:** 12th consecutive day of elevated cost. Despite slight improvement ($4.82→$3.97), still 12× more expensive than most claude workflows. Root cause under investigation via issue #18388.
- **Chroma Issue Indexer token growth:** 3.6M tokens is abnormally high for an issue indexer. If the issue backlog is growing, this will continue to scale up linearly. No issue yet created.
- **Codex extreme token counts:** Changeset Generator (123M) and Smoke Codex (32M) show Codex engine's parallel-context behavior. Not quality issues but skew overall token metrics significantly.

#### Ecosystem Coverage Assessment
- ✅ Security: The Great Escapi active and efficient
- ✅ Code quality: Smoke tests (Copilot/Claude/Codex) passing on main
- ✅ Documentation: Slide Deck Maintainer running (high tokens, worth monitoring)
- ✅ Release: Workflow completed successfully today
- ⚠️ Issue triage: AI Moderator intermittent (33% miss rate today)
- ❌ Issue monitoring: Issue Monster, Daily Issues Report locked out

</details>

### Recommendations

#### High Priority

1. **Investigate AI Moderator GitHub MCP reliability** — 3rd incident in a week
 - The 1/3 miss rate today suggests `mode: remote` is not a reliable fix
 - Consider: adding retry logic, fallback to `mode: local` if remote unavailable, or alert on noop runs
 - Affected run: [§22453521501](https://github.com/github/gh-aw/actions/runs/22453521501)

2. **Chroma Issue Indexer token usage investigation** — 3.6M tokens is a new high
 - Determine if issue backlog growth is expected or indicates runaway indexing
 - 102 blocked firewall requests also the highest in ecosystem — understand what it's attempting to reach
 - Consider creating issue to track and cap maximum tokens per run

#### Medium Priority

3. **Semantic Function Refactoring cost** — Slight improvement ($3.97) but still high
 - Issue #18388 exists — check if any action has been taken
 - 72 blocked requests suggest scope creep beyond allowed network

4. **Lockdown P0 escalation** — All programmatic fix paths closed (#17414, #17807 both "not_planned")
 - 4 workflows generating failure noise daily
 - Recommend direct escalation to repository maintainers (not via issue)

#### Low Priority

5. **Smoke Claude duration** — 12.9m and 42 turns is the longest smoke test
 - All other smokes complete in <7m — investigate if Smoke Claude is testing more or stuck in retry loops

### Trends (7-day)

- Agent quality: 86/100 (↓ from 89 — AI Moderator regression and Chroma concern)
- Total metered cost: $5.94 (↓ from $6.14 — small improvement)
- Firewall block rate: 47% (stable/elevated — "-" domain artifacts persist)
- Smoke test health: ✅ All passing on main
- Lockdown failures: 4 workflows (→ unchanged, 3+ weeks)

### Actions Taken This Run

- Updated `agent-performance-latest.md` in shared repo memory
- Updated `shared-alerts.md` with AI Moderator regression and Chroma concern
- Generated this performance report discussion

---
> Analysis period: 2026-02-25 → 2026-02-26
> Next report: 2026-02-27
> **References:** [§22453850435](https://github.com/github/gh-aw/actions/runs/22453850435) | [§22408567616](https://github.com/github/gh-aw/actions/runs/22408567616) | [§22453521501](https://github.com/github/gh-aw/actions/runs/22453521501)

---

> [!WARNING]
> This was intended to be a discussion, but discussions could not be created due to permissions issues. This issue was created as a fallback.
>
> Discussion creation may fail if the specified category is not announcement-capable. Consider using the "Announcements" category or another announcement-capable category in your workflow configuration.




> Generated by [Agent Performance Analyzer - Meta-Orchestrator](https://github.com/github/gh-aw/actions/runs/22453850435)
> - [x] expires  on Feb 27, 2026, 5:48 PM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Performance Report - Week of 2026-02-26 #18544

Performance Summary

Critical Findings

Agent Quality Scores (Today)

Cancelled Runs Analysis

Task Completion Rates (Sampled Agent Runs)

Cost Efficiency Trends

Firewall Request Analysis

Productive Patterns ✅

Problematic Patterns ⚠️

Ecosystem Coverage Assessment

Recommendations

High Priority

Medium Priority

Low Priority

Trends (7-day)

Actions Taken This Run

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Agent	Engine	Quality	Duration	Tokens	Cost	Notes
The Great Escapi	copilot	94/100	3.5m	74k	—	Ultra-efficient
Contribution Check	copilot	93/100	2.8m	181k	—	Fast, clean
Daily Safe Outputs Conformance Checker	claude	92/100	3.1m	134k	$0.33	Efficient
Auto-Triage Issues	copilot	90/100	3.5m	136k	—	Success
Agent Container Smoke Test	copilot	90/100	4.4m	174k	—	Clean
Smoke Copilot	copilot	90/100	6.7m	—	—	49 turns, passing
Smoke Claude	claude	87/100	12.9m	991k	$1.47	42 turns, long
Lockfile Statistics Analysis Agent	claude	87/100	5.0m	456k	$0.82	14 turns, normal
AI Moderator (×3)	codex	82/100	7.5–8.9m	210–372k	—	1/3 missing tool
Scout	claude	80/100	4.9m	613k	$0.81	19 turns
Smoke Codex	codex	80/100	6.8m	32M	—	17 turns, Codex tokens
Slide Deck Maintainer	copilot	78/100	6.7m	1.5M	—	High tokens
Changeset Generator	codex	75/100	8.2m	123M	—	Codex parallelism
Semantic Function Refactoring	claude	72/100	9.1m	295k	$3.97	High cost, 12 turns
Chroma Issue Indexer	copilot	68/100	10.5m	3.6M	—	Extreme tokens

Agent	Today	Yesterday	Δ
Semantic Function Refactoring	$3.97	$4.82	↓ $0.85 ✅
Scout	$0.81	—	New data point
Daily Safe Outputs Conformance Checker	$0.33	—	Consistent
Lockfile Statistics Analysis Agent	$0.82	—	Consistent
Smoke Claude	$1.47	—	Long duration
Total (metered)	$5.94	$6.14	↓ $0.20 ✅

Agent Performance Report - Week of 2026-02-26 #18544

Description

Performance Summary

Critical Findings

Agent Quality Scores (Today)

Cancelled Runs Analysis

Task Completion Rates (Sampled Agent Runs)

Cost Efficiency Trends

Firewall Request Analysis

Productive Patterns ✅

Problematic Patterns ⚠️

Ecosystem Coverage Assessment

Recommendations

High Priority

Medium Priority

Low Priority

Trends (7-day)

Actions Taken This Run

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions