Add A/B experiment for issue-arborist prompt_style (#30015) by Copilot · Pull Request #30016 · github/gh-aw

Copilot · 2026-05-03T21:42:02Z

Sets up a 50/50 A/B experiment on issue-arborist to determine whether the current verbose 5-step prompt (detailed) outperforms a concise 4-step variant by ≥15% on links_created per run.

Frontmatter changes

Added experiments.prompt_style block with concise/detailed variants, Mann-Whitney analysis, 30-sample minimum, empty_output_rate == 0 guardrail, and issue: 30015
Fixed guardrail_metrics.threshold to use expression syntax "==0" (schema requires operator-prefixed string, not bare float)

Prompt changes

Wrapped the existing prompt body in {{#if experiments.prompt_style == "detailed"}} / {{else}} / {{/if}}:

detailed — original prompt preserved verbatim (5 steps, jq examples, output template)
concise — new 4-bullet variant directing the model to self-direct via jq, create parent issues, link pairs, and post a create_discussion summary

experiments:
  prompt_style:
    variants: [concise, detailed]
    metric: links_created
    secondary_metrics: [run_duration_ms, discussion_created]
    guardrail_metrics:
      - name: empty_output_rate
        threshold: "==0"
    min_samples: 30
    weight: [50, 50]
    analysis_type: mann_whitney
    start_date: "2026-05-05"
    issue: 30015

Compiler validation

Added validateNoPreExpandedExperimentPlaceholders() in template_validation.go which raises a compile-time error if the internal __GH_AW_EXPERIMENTS_*__ placeholder form is used directly in a {{#if ...}} condition. Authors must use the experiments.<name> dot-notation form (e.g. experiments.prompt_style == "detailed").

The validation uses the existing TemplateIfPattern from expression_patterns.go, which correctly handles conditions containing embedded ${{ ... }} blocks without false negatives.

Lock file regenerated via gh aw compile issue-arborist.

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/8d84e107-4daa-480c-b595-fde7c844eddf Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

pelikhan · 2026-05-03T21:58:23Z

+      - name: empty_output_rate
+        threshold: "==0"
+    min_samples: 30
+    owner: "@team-agents"


@copilot remove this line

Done in 85d4daa — removed the owner line.

pelikhan · 2026-05-03T21:59:42Z


 ---

+{{#if __GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed}}


@copilot this conditions is pre expanded and should be experiments... instead. Update compiler to raise an error on this kind of syntactic errors.

Fixed in 85d4daa — changed the condition to {{#if experiments.prompt_style == "detailed"}} (the experiments.<name> dot-notation form).

Also added validateNoPreExpandedExperimentPlaceholders() in template_validation.go which scans all {{#if ...}} condition expressions and raises an error if any __GH_AW_EXPERIMENTS_*__ placeholder is found, with a hint to use experiments.<name> instead. Wired into the compiler in compiler_orchestrator_frontmatter.go.

…tion for pre-expanded placeholders Agent-Logs-Url: https://github.com/github/gh-aw/sessions/86f11b7c-590c-4f98-b723-e57b6871ff4f Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

github-actions · 2026-05-03T22:24:58Z

⚠️ Design Decision Gate — Implementation Diverges from ADR

This PR implements an A/B experiment for issue-arborist prompt_style, which falls under the A/B experiment framework documented in the linked ADR.

ADR reviewed: ADR-29534: Frontmatter A/B Experiments with Balanced Variant Selection

Verification Summary

The PR is broadly aligned with the A/B experiment design — it uses the experiments frontmatter block, exposes GH_AW_EXPERIMENTS_PROMPT_STYLE env var, threads experiment outputs through the activation job, and adds a helpful compiler validation guard (validateNoPreExpandedExperimentPlaceholders). However, three normative requirements from Part 2 of the ADR are not met by this implementation.

Divergences Found

1. State persistence mechanism (ADR Requirement #7)

The ADR (Requirement #7) requires:

Variant counts MUST be persisted between workflow runs using the GitHub Actions cache.

This PR persists state to a dedicated git branch (experiments/issuearborist) via a new push_experiments_state job using push_experiment_state.cjs and load_experiment_state_from_repo.cjs. This is a different persistence mechanism than specified.

2. Artifact naming (ADR Requirement #9)

The ADR (Requirement #9) requires:

Implementations MUST upload the experiment state directory as an artifact named experiment.

This PR uploads and downloads the artifact as issuearborist-experiment — not experiment.

3. Activation job structure (ADR Requirement #15)

The ADR (Requirement #15) specifies the activation job MUST contain, in order: cache restore, variant selection, cache save (if: always()), and artifact upload. This PR omits the cache save step entirely and replaces it with a separate push_experiments_state job that runs after the agent job.

What to do next

Either:

Update the code to align with ADR-29534 (use GitHub Actions cache for persistence, name artifact experiment, add cache save step in activation job), OR
Update ADR-29534 to reflect the revised decision to use a git branch for state persistence and per-workflow artifact naming (and document why the approach changed from the original design)

The ADR and implementation must be in sync before this PR can merge.

📝 Context: What the ADR covers

ADR-29534 documents the first-class A/B experiment mechanism for gh-aw workflows, including frontmatter schema, variant selection algorithm, state persistence, expression/template integration, and activation job structure. The PR adds the first concrete use of this mechanism (prompt_style on issue-arborist) and also adds a compiler guard to prevent misuse of the internal placeholder form.

The compiler validation addition (validateNoPreExpandedExperimentPlaceholders) is not explicitly covered in ADR-29534 but is consistent with Requirement #14 (placeholders must not reach Handlebars before substitution) — this is a welcome defensive addition and does not constitute a divergence.

🔒 This PR cannot merge until the ADR and implementation are reconciled.

🏗️ ADR gate enforced by Design Decision Gate 🏗️ · ● 117.2K · ◷

github-actions · 2026-05-03T22:25:28Z

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100

✅ Excellent test quality

Metric	Value
New/modified tests analyzed	1 (table-driven, 3 scenarios)
✅ Design tests (behavioral contracts)	1 (100%)
⚠️ Implementation tests (low value)	0 (0%)
Tests with error/edge cases	1 (100%)
Duplicate test clusters	0
Test inflation detected	No (ratio: 1.98:1)
🚨 Coding-guideline violations	None

Test Classification Details

Test	File	Classification	Issues Detected
`TestCompilerRejectsPreExpandedExperimentPlaceholders`	`pkg/workflow/compiler_template_validation_test.go:133`	✅ Design	None

Analysis

TestCompilerRejectsPreExpandedExperimentPlaceholders is a well-structured table-driven test with 3 scenarios:

✅ Happy path — valid experiments.prompt_style dot-notation compiles without error
✅ Error case — __GH_AW_EXPERIMENTS_PROMPT_STYLE__ pre-expanded placeholder is rejected with the correct error message
✅ Error case — __GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed variant form is rejected with the correct guidance message

Design invariant enforced: The compiler must reject workflow markdown files that contain pre-expanded internal experiment placeholder tokens (double-underscore form), directing authors to use the correct experiments.<name> syntax instead. This is a behavioral contract between the compiler and workflow authors.

Build tag: ✅ //go:build !integration present on line 1

No mock libraries used — real compiler instantiation via NewCompiler() ✅

Error assertions include the test case name for context (t.Errorf("Expected error for %s", tt.name)), which is acceptable ✅

Inflation ratio: 85 test lines / 43 production lines = 1.98:1 — just under the 2:1 threshold ✅

Language Support

Tests analyzed:

🐹 Go (*_test.go): 1 test function (3 table-driven scenarios) — unit (//go:build !integration)
🟨 JavaScript (*.test.cjs, *.test.js): 0 tests

Verdict

✅ Check passed. 0% of new tests are implementation tests (threshold: 30%). The test covers real behavioral contracts — both happy-path and two distinct error rejection paths — using real component interactions with no mocking.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

Assert on observable outputs, return values, or state changes
Cover error paths and boundary conditions
Would catch a behavioral regression if deleted
Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

Assert on internal function calls (mocking internals)
Only test the happy path with typical inputs
Break during legitimate refactoring even when behavior is correct
Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §25292452480

🧪 Test quality analysis by Test Quality Sentinel · ● 548.6K · ◷

github-actions

✅ Test Quality Sentinel: 100/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). The single new table-driven test covers 3 scenarios including 2 error cases, enforces a real behavioral contract, uses no mocks, and has the required build tag.

Copilot

Pull request overview

Adds a 50/50 A/B experiment to the issue-arborist workflow to compare a concise vs detailed prompt, and introduces compiler-time validation to prevent authors from using internal pre-expanded experiment placeholders in {{#if ...}} conditions.

Changes:

Adds experiments.prompt_style configuration to issue-arborist frontmatter and wraps the prompt body in a template conditional to select between detailed and concise variants.
Adds validateNoPreExpandedExperimentPlaceholders() and wires it into frontmatter parsing validation.
Regenerates the compiled lock workflow to include experiment selection/state steps and propagated GH_AW_EXPERIMENTS_* env vars.

Show a summary per file

File	Description
pkg/workflow/template_validation.go	Adds a new validation pass intended to reject internal `__GH_AW_EXPERIMENTS_*__` placeholders in `{{#if ...}}` conditions.
pkg/workflow/compiler_template_validation_test.go	Adds compiler tests asserting placeholder-based `{{#if ...}}` conditions are rejected.
pkg/workflow/compiler_orchestrator_frontmatter.go	Invokes the new experiment-placeholder validation during frontmatter/markdown parsing.
.github/workflows/issue-arborist.md	Declares the `prompt_style` experiment and adds concise-vs-detailed prompt branching.
.github/workflows/issue-arborist.lock.yml	Regenerated compiled workflow including experiment state restore/pick/push and env propagation.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

pkg/workflow/template_validation.go:57

preExpandedExperimentPattern only matches the __GH_AW_EXPERIMENTS_<NAME>__ form. It will not match other placeholder-like forms that start with __GH_AW_EXPERIMENTS (e.g. the __GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed string used in the new test), so the validator won’t reject them. Consider broadening this pattern to catch any __GH_AW_EXPERIMENTS... placeholder token(s) that could appear in a condition (or explicitly remove/adjust the test case if that format is not actually produced).

	// preExpandedExperimentPattern matches the internal __GH_AW_EXPERIMENTS_*__ placeholder form
	// that is produced by the runtime and must never be written manually in workflow markdown.
	// Authors should use the experiments.<name> form (e.g. experiments.prompt_style == "detailed").
	preExpandedExperimentPattern = regexp.MustCompile(`__GH_AW_EXPERIMENTS_[A-Z0-9_]+__`)

Files reviewed: 5/5 changed files
Comments generated: 5

+	// templateIfConditionPattern matches the condition expression inside {{#if ...}} tags.
+	// Captures the condition string between #if and the closing }}.
+	templateIfConditionPattern = regexp.MustCompile(`\{\{#if\s+(.*?)\s*\}\}`)
+
+	// preExpandedExperimentPattern matches the internal __GH_AW_EXPERIMENTS_*__ placeholder form
+	// that is produced by the runtime and must never be written manually in workflow markdown.
+	// Authors should use the experiments.<name> form (e.g. experiments.prompt_style == "detailed").
+	preExpandedExperimentPattern = regexp.MustCompile(`__GH_AW_EXPERIMENTS_[A-Z0-9_]+__`)


 // # Validation Functions
 //
 //   - validateNoIncludesInTemplateRegions() - Validates that imports are not inside template blocks
+//   - validateNoPreExpandedExperimentPlaceholders() - Validates that pre-expanded __GH_AW_EXPERIMENTS__*__ placeholders are not used in template conditions


+		{
+			name: "invalid workflow with double-underscore variant form",
+			content: `---
+on: issues
+permissions:
+  issues: read
+strict: false
+---
+
+{{#if __GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed}}
+detailed prompt
+{{/if}}`,
+			shouldError: true,
+			errContains: "use experiments.<name> instead",
+		},


+	// Validate that pre-expanded __GH_AW_EXPERIMENTS_*__ placeholders are not used in template conditions
+	if err := validateNoPreExpandedExperimentPlaceholders(result.Markdown); err != nil {
+		orchestratorFrontmatterLog.Printf("Pre-expanded experiment placeholder validation failed: %v", err)
+		return nil, fmt.Errorf("template condition validation failed: %w", err)
+	}


+1. Use `jq` to identify clusters of 5+ related issues that share a theme but lack a parent.
+2. Create a parent issue (title prefix `[Parent] `) for each cluster and link its members as sub-issues.
+3. Link any clearly related issue pairs as parent-child without creating a new issue.
+4. Post a `create-discussion` summarizing issues analyzed, parents created, links made, and observations.


pelikhan · 2026-05-03T22:28:26Z

@copilot Review all comments

… create_discussion, update tests Agent-Logs-Url: https://github.com/github/gh-aw/sessions/1e53a068-92aa-4e70-a871-d2d02a3318a3 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot · 2026-05-03T22:36:36Z

@copilot Review all comments

Addressed all actionable items in 655f3c3:

template_validation.go:50-57 — replaced the custom templateIfConditionPattern with the existing TemplateIfPattern from expression_patterns.go, which correctly handles embedded ${{ ... }} blocks in conditions.
template_validation.go:11 — fixed docstring: __GH_AW_EXPERIMENTS__*__ → __GH_AW_EXPERIMENTS_*__.
compiler_template_validation_test.go:175-189 — replaced the contrived double-underscore test case with two clearer cases: one testing a trailing-space variant and one verifying that conditions containing ${{ ... }} are not incorrectly flagged.
issue-arborist.md:238 — fixed create-discussion → create_discussion to match the safe-output tool identifier.
Imports scope — the reviewer's note that imported files aren't checked is accurate; that's a known limitation of the current compile-time validation scope (it validates the main markdown only). Extending it to transitive imports would require a larger refactor and is out of scope for this PR.

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Initial plan

1b8d2df

Copilot AI assigned Copilot and pelikhan May 3, 2026

Copilot started work on behalf of pelikhan May 3, 2026 21:42 View session

Copilot AI linked an issue May 3, 2026 that may be closed by this pull request

[ab-advisor] Experiment campaign for issue-arborist: A/B test prompt_style #30015

Closed

7 tasks

Add A/B experiment for issue-arborist prompt_style (#30015)

f30ca31

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/8d84e107-4daa-480c-b595-fde7c844eddf Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add A/B test for prompt style in issue arborist~~ Add A/B experiment for issue-arborist prompt_style (#30015) May 3, 2026

Copilot finished work on behalf of pelikhan May 3, 2026 21:57

Copilot AI requested a review from pelikhan May 3, 2026 21:57

pelikhan reviewed May 3, 2026

View reviewed changes

Copilot started work on behalf of pelikhan May 3, 2026 22:07 View session

Fix owner field, experiment condition syntax, and add compiler valida…

85d4daa

…tion for pre-expanded placeholders Agent-Logs-Url: https://github.com/github/gh-aw/sessions/86f11b7c-590c-4f98-b723-e57b6871ff4f Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot finished work on behalf of pelikhan May 3, 2026 22:17

Copilot AI requested a review from pelikhan May 3, 2026 22:17

pelikhan marked this pull request as ready for review May 3, 2026 22:19

Copilot AI review requested due to automatic review settings May 3, 2026 22:19

Copilot started reviewing on behalf of pelikhan May 3, 2026 22:20 View session

github-actions Bot approved these changes May 3, 2026

View reviewed changes

Copilot AI reviewed May 3, 2026

View reviewed changes

Copilot started work on behalf of pelikhan May 3, 2026 22:28 View session

Address PR review feedback: use TemplateIfPattern, fix docstring, fix…

655f3c3

… create_discussion, update tests Agent-Logs-Url: https://github.com/github/gh-aw/sessions/1e53a068-92aa-4e70-a871-d2d02a3318a3 Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Copilot finished work on behalf of pelikhan May 3, 2026 22:37

pelikhan merged commit dc922ac into main May 3, 2026

pelikhan deleted the copilot/ab-advisor-experiment-prompt-style branch May 3, 2026 22:39

Copilot AI added a commit that referenced this pull request May 3, 2026

Add A/B experiment for issue-arborist prompt_style (#30015) (#30016)

1c78fba

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add A/B experiment for issue-arborist prompt_style (#30015)#30016

Add A/B experiment for issue-arborist prompt_style (#30015)#30016
pelikhan merged 4 commits into
mainfrom
copilot/ab-advisor-experiment-prompt-style

Copilot AI commented May 3, 2026 •

edited

Loading

Uh oh!

pelikhan May 3, 2026

Uh oh!

Copilot AI May 3, 2026

Uh oh!

pelikhan May 3, 2026

Uh oh!

Copilot AI May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot commented May 3, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

pelikhan commented May 3, 2026

Uh oh!

Copilot AI commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Frontmatter changes

Prompt changes

Compiler validation

Uh oh!

pelikhan May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 3, 2026

Choose a reason for hiding this comment

Uh oh!

pelikhan May 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 3, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 3, 2026

⚠️ Design Decision Gate — Implementation Diverges from ADR

Verification Summary

Divergences Found

What to do next

Uh oh!

github-actions Bot commented May 3, 2026

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100

Test Classification Details

Analysis

Language Support

Verdict

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Copilot's findings

Uh oh!

pelikhan commented May 3, 2026

Uh oh!

Copilot AI commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 3, 2026 •

edited

Loading