Skip to content

Add A/B experiment for issue-arborist prompt_style (#30015)#30016

Merged
pelikhan merged 4 commits into
mainfrom
copilot/ab-advisor-experiment-prompt-style
May 3, 2026
Merged

Add A/B experiment for issue-arborist prompt_style (#30015)#30016
pelikhan merged 4 commits into
mainfrom
copilot/ab-advisor-experiment-prompt-style

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 3, 2026

Sets up a 50/50 A/B experiment on issue-arborist to determine whether the current verbose 5-step prompt (detailed) outperforms a concise 4-step variant by ≥15% on links_created per run.

Frontmatter changes

  • Added experiments.prompt_style block with concise/detailed variants, Mann-Whitney analysis, 30-sample minimum, empty_output_rate == 0 guardrail, and issue: 30015
  • Fixed guardrail_metrics.threshold to use expression syntax "==0" (schema requires operator-prefixed string, not bare float)

Prompt changes

Wrapped the existing prompt body in {{#if experiments.prompt_style == "detailed"}} / {{else}} / {{/if}}:

  • detailed — original prompt preserved verbatim (5 steps, jq examples, output template)
  • concise — new 4-bullet variant directing the model to self-direct via jq, create parent issues, link pairs, and post a create_discussion summary
experiments:
  prompt_style:
    variants: [concise, detailed]
    metric: links_created
    secondary_metrics: [run_duration_ms, discussion_created]
    guardrail_metrics:
      - name: empty_output_rate
        threshold: "==0"
    min_samples: 30
    weight: [50, 50]
    analysis_type: mann_whitney
    start_date: "2026-05-05"
    issue: 30015

Compiler validation

Added validateNoPreExpandedExperimentPlaceholders() in template_validation.go which raises a compile-time error if the internal __GH_AW_EXPERIMENTS_*__ placeholder form is used directly in a {{#if ...}} condition. Authors must use the experiments.<name> dot-notation form (e.g. experiments.prompt_style == "detailed").

The validation uses the existing TemplateIfPattern from expression_patterns.go, which correctly handles conditions containing embedded ${{ ... }} blocks without false negatives.

Lock file regenerated via gh aw compile issue-arborist.

Copilot AI changed the title [WIP] Add A/B test for prompt style in issue arborist Add A/B experiment for issue-arborist prompt_style (#30015) May 3, 2026
Copilot AI requested a review from pelikhan May 3, 2026 21:57
Comment thread .github/workflows/issue-arborist.md Outdated
- name: empty_output_rate
threshold: "==0"
min_samples: 30
owner: "@team-agents"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot remove this line

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 85d4daa — removed the owner line.

Comment thread .github/workflows/issue-arborist.md Outdated

---

{{#if __GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed}}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot this conditions is pre expanded and should be experiments... instead. Update compiler to raise an error on this kind of syntactic errors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 85d4daa — changed the condition to {{#if experiments.prompt_style == "detailed"}} (the experiments.<name> dot-notation form).

Also added validateNoPreExpandedExperimentPlaceholders() in template_validation.go which scans all {{#if ...}} condition expressions and raises an error if any __GH_AW_EXPERIMENTS_*__ placeholder is found, with a hint to use experiments.<name> instead. Wired into the compiler in compiler_orchestrator_frontmatter.go.

…tion for pre-expanded placeholders

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/86f11b7c-590c-4f98-b723-e57b6871ff4f

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI requested a review from pelikhan May 3, 2026 22:17
@pelikhan pelikhan marked this pull request as ready for review May 3, 2026 22:19
Copilot AI review requested due to automatic review settings May 3, 2026 22:19
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

⚠️ Design Decision Gate — Implementation Diverges from ADR

This PR implements an A/B experiment for issue-arborist prompt_style, which falls under the A/B experiment framework documented in the linked ADR.

ADR reviewed: ADR-29534: Frontmatter A/B Experiments with Balanced Variant Selection

Verification Summary

The PR is broadly aligned with the A/B experiment design — it uses the experiments frontmatter block, exposes GH_AW_EXPERIMENTS_PROMPT_STYLE env var, threads experiment outputs through the activation job, and adds a helpful compiler validation guard (validateNoPreExpandedExperimentPlaceholders). However, three normative requirements from Part 2 of the ADR are not met by this implementation.

Divergences Found

1. State persistence mechanism (ADR Requirement #7)

The ADR (Requirement #7) requires:

Variant counts MUST be persisted between workflow runs using the GitHub Actions cache.

This PR persists state to a dedicated git branch (experiments/issuearborist) via a new push_experiments_state job using push_experiment_state.cjs and load_experiment_state_from_repo.cjs. This is a different persistence mechanism than specified.

2. Artifact naming (ADR Requirement #9)

The ADR (Requirement #9) requires:

Implementations MUST upload the experiment state directory as an artifact named experiment.

This PR uploads and downloads the artifact as issuearborist-experiment — not experiment.

3. Activation job structure (ADR Requirement #15)

The ADR (Requirement #15) specifies the activation job MUST contain, in order: cache restore, variant selection, cache save (if: always()), and artifact upload. This PR omits the cache save step entirely and replaces it with a separate push_experiments_state job that runs after the agent job.

What to do next

Either:

  1. Update the code to align with ADR-29534 (use GitHub Actions cache for persistence, name artifact experiment, add cache save step in activation job), OR
  2. Update ADR-29534 to reflect the revised decision to use a git branch for state persistence and per-workflow artifact naming (and document why the approach changed from the original design)

The ADR and implementation must be in sync before this PR can merge.


📝 Context: What the ADR covers

ADR-29534 documents the first-class A/B experiment mechanism for gh-aw workflows, including frontmatter schema, variant selection algorithm, state persistence, expression/template integration, and activation job structure. The PR adds the first concrete use of this mechanism (prompt_style on issue-arborist) and also adds a compiler guard to prevent misuse of the internal placeholder form.

The compiler validation addition (validateNoPreExpandedExperimentPlaceholders) is not explicitly covered in ADR-29534 but is consistent with Requirement #14 (placeholders must not reach Handlebars before substitution) — this is a welcome defensive addition and does not constitute a divergence.

🔒 This PR cannot merge until the ADR and implementation are reconciled.

🏗️ ADR gate enforced by Design Decision Gate 🏗️ · ● 117.2K ·

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

🧪 Test Quality Sentinel Report

Test Quality Score: 100/100

Excellent test quality

Metric Value
New/modified tests analyzed 1 (table-driven, 3 scenarios)
✅ Design tests (behavioral contracts) 1 (100%)
⚠️ Implementation tests (low value) 0 (0%)
Tests with error/edge cases 1 (100%)
Duplicate test clusters 0
Test inflation detected No (ratio: 1.98:1)
🚨 Coding-guideline violations None

Test Classification Details

Test File Classification Issues Detected
TestCompilerRejectsPreExpandedExperimentPlaceholders pkg/workflow/compiler_template_validation_test.go:133 ✅ Design None

Analysis

TestCompilerRejectsPreExpandedExperimentPlaceholders is a well-structured table-driven test with 3 scenarios:

  1. Happy path — valid experiments.prompt_style dot-notation compiles without error
  2. Error case__GH_AW_EXPERIMENTS_PROMPT_STYLE__ pre-expanded placeholder is rejected with the correct error message
  3. Error case__GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed variant form is rejected with the correct guidance message

Design invariant enforced: The compiler must reject workflow markdown files that contain pre-expanded internal experiment placeholder tokens (double-underscore form), directing authors to use the correct experiments.<name> syntax instead. This is a behavioral contract between the compiler and workflow authors.

Build tag: ✅ //go:build !integration present on line 1

No mock libraries used — real compiler instantiation via NewCompiler()

Error assertions include the test case name for context (t.Errorf("Expected error for %s", tt.name)), which is acceptable ✅

Inflation ratio: 85 test lines / 43 production lines = 1.98:1 — just under the 2:1 threshold ✅


Language Support

Tests analyzed:

  • 🐹 Go (*_test.go): 1 test function (3 table-driven scenarios) — unit (//go:build !integration)
  • 🟨 JavaScript (*.test.cjs, *.test.js): 0 tests

Verdict

Check passed. 0% of new tests are implementation tests (threshold: 30%). The test covers real behavioral contracts — both happy-path and two distinct error rejection paths — using real component interactions with no mocking.

📖 Understanding Test Classifications

Design Tests (High Value) verify what the system does:

  • Assert on observable outputs, return values, or state changes
  • Cover error paths and boundary conditions
  • Would catch a behavioral regression if deleted
  • Remain valid even after internal refactoring

Implementation Tests (Low Value) verify how the system does it:

  • Assert on internal function calls (mocking internals)
  • Only test the happy path with typical inputs
  • Break during legitimate refactoring even when behavior is correct
  • Give false assurance: they pass even when the system is wrong

Goal: Shift toward tests that describe the system's behavioral contract — the promises it makes to its users and collaborators.

References: §25292452480

🧪 Test quality analysis by Test Quality Sentinel · ● 548.6K ·

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Test Quality Sentinel: 100/100. Test quality is excellent — 0% of new tests are implementation tests (threshold: 30%). The single new table-driven test covers 3 scenarios including 2 error cases, enforces a real behavioral contract, uses no mocks, and has the required build tag.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a 50/50 A/B experiment to the issue-arborist workflow to compare a concise vs detailed prompt, and introduces compiler-time validation to prevent authors from using internal pre-expanded experiment placeholders in {{#if ...}} conditions.

Changes:

  • Adds experiments.prompt_style configuration to issue-arborist frontmatter and wraps the prompt body in a template conditional to select between detailed and concise variants.
  • Adds validateNoPreExpandedExperimentPlaceholders() and wires it into frontmatter parsing validation.
  • Regenerates the compiled lock workflow to include experiment selection/state steps and propagated GH_AW_EXPERIMENTS_* env vars.
Show a summary per file
File Description
pkg/workflow/template_validation.go Adds a new validation pass intended to reject internal __GH_AW_EXPERIMENTS_*__ placeholders in {{#if ...}} conditions.
pkg/workflow/compiler_template_validation_test.go Adds compiler tests asserting placeholder-based {{#if ...}} conditions are rejected.
pkg/workflow/compiler_orchestrator_frontmatter.go Invokes the new experiment-placeholder validation during frontmatter/markdown parsing.
.github/workflows/issue-arborist.md Declares the prompt_style experiment and adds concise-vs-detailed prompt branching.
.github/workflows/issue-arborist.lock.yml Regenerated compiled workflow including experiment state restore/pick/push and env propagation.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comments suppressed due to low confidence (1)

pkg/workflow/template_validation.go:57

  • preExpandedExperimentPattern only matches the __GH_AW_EXPERIMENTS_<NAME>__ form. It will not match other placeholder-like forms that start with __GH_AW_EXPERIMENTS (e.g. the __GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed string used in the new test), so the validator won’t reject them. Consider broadening this pattern to catch any __GH_AW_EXPERIMENTS... placeholder token(s) that could appear in a condition (or explicitly remove/adjust the test case if that format is not actually produced).
	// preExpandedExperimentPattern matches the internal __GH_AW_EXPERIMENTS_*__ placeholder form
	// that is produced by the runtime and must never be written manually in workflow markdown.
	// Authors should use the experiments.<name> form (e.g. experiments.prompt_style == "detailed").
	preExpandedExperimentPattern = regexp.MustCompile(`__GH_AW_EXPERIMENTS_[A-Z0-9_]+__`)
  • Files reviewed: 5/5 changed files
  • Comments generated: 5

Comment thread pkg/workflow/template_validation.go Outdated
Comment on lines +50 to +57
// templateIfConditionPattern matches the condition expression inside {{#if ...}} tags.
// Captures the condition string between #if and the closing }}.
templateIfConditionPattern = regexp.MustCompile(`\{\{#if\s+(.*?)\s*\}\}`)

// preExpandedExperimentPattern matches the internal __GH_AW_EXPERIMENTS_*__ placeholder form
// that is produced by the runtime and must never be written manually in workflow markdown.
// Authors should use the experiments.<name> form (e.g. experiments.prompt_style == "detailed").
preExpandedExperimentPattern = regexp.MustCompile(`__GH_AW_EXPERIMENTS_[A-Z0-9_]+__`)
Comment thread pkg/workflow/template_validation.go Outdated
// # Validation Functions
//
// - validateNoIncludesInTemplateRegions() - Validates that imports are not inside template blocks
// - validateNoPreExpandedExperimentPlaceholders() - Validates that pre-expanded __GH_AW_EXPERIMENTS__*__ placeholders are not used in template conditions
Comment on lines +175 to +189
{
name: "invalid workflow with double-underscore variant form",
content: `---
on: issues
permissions:
issues: read
strict: false
---

{{#if __GH_AW_EXPERIMENTS__PROMPT_STYLE___detailed}}
detailed prompt
{{/if}}`,
shouldError: true,
errContains: "use experiments.<name> instead",
},
Comment on lines +137 to +141
// Validate that pre-expanded __GH_AW_EXPERIMENTS_*__ placeholders are not used in template conditions
if err := validateNoPreExpandedExperimentPlaceholders(result.Markdown); err != nil {
orchestratorFrontmatterLog.Printf("Pre-expanded experiment placeholder validation failed: %v", err)
return nil, fmt.Errorf("template condition validation failed: %w", err)
}
Comment thread .github/workflows/issue-arborist.md Outdated
1. Use `jq` to identify clusters of 5+ related issues that share a theme but lack a parent.
2. Create a parent issue (title prefix `[Parent] `) for each cluster and link its members as sub-issues.
3. Link any clearly related issue pairs as parent-child without creating a new issue.
4. Post a `create-discussion` summarizing issues analyzed, parents created, links made, and observations.
@pelikhan
Copy link
Copy Markdown
Collaborator

pelikhan commented May 3, 2026

@copilot Review all comments

… create_discussion, update tests

Agent-Logs-Url: https://github.com/github/gh-aw/sessions/1e53a068-92aa-4e70-a871-d2d02a3318a3

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 3, 2026

@copilot Review all comments

Addressed all actionable items in 655f3c3:

  • template_validation.go:50-57 — replaced the custom templateIfConditionPattern with the existing TemplateIfPattern from expression_patterns.go, which correctly handles embedded ${{ ... }} blocks in conditions.
  • template_validation.go:11 — fixed docstring: __GH_AW_EXPERIMENTS__*____GH_AW_EXPERIMENTS_*__.
  • compiler_template_validation_test.go:175-189 — replaced the contrived double-underscore test case with two clearer cases: one testing a trailing-space variant and one verifying that conditions containing ${{ ... }} are not incorrectly flagged.
  • issue-arborist.md:238 — fixed create-discussioncreate_discussion to match the safe-output tool identifier.
  • Imports scope — the reviewer's note that imported files aren't checked is accurate; that's a known limitation of the current compile-time validation scope (it validates the main markdown only). Extending it to transitive imports would require a larger refactor and is out of scope for this PR.

@pelikhan pelikhan merged commit dc922ac into main May 3, 2026
@pelikhan pelikhan deleted the copilot/ab-advisor-experiment-prompt-style branch May 3, 2026 22:39
Copilot AI added a commit that referenced this pull request May 3, 2026
Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ab-advisor] Experiment campaign for issue-arborist: A/B test prompt_style

3 participants