Skip to content

Combine #28758 + #28117 + #28147 — Agent Shin auto-close + review-gate label lifecycle#28759

Closed
mateo-berri wants to merge 36 commits into
litellm_internal_stagingfrom
combine-28758-28117-28147
Closed

Combine #28758 + #28117 + #28147 — Agent Shin auto-close + review-gate label lifecycle#28759
mateo-berri wants to merge 36 commits into
litellm_internal_stagingfrom
combine-28758-28117-28147

Conversation

@mateo-berri

@mateo-berri mateo-berri commented May 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

Combines three open PRs into a single branch targeting litellm_internal_staging:

How it was assembled

  1. New branch off litellm_internal_staging (the base of feat(triage): Agent Shin — LLM-judge + Greptile auto-close (any age, any draft state) + @agent-shin reconsider flow #28117).
  2. Merged litellm_fix-agent-shin-reconsider-safety-1af4 (fix(triage): bugbot on #28117 — reconsider safety + 1-day grace + @greptileai post-close + SwiftWinds dogfood #28147) — this brings feat(triage): Agent Shin — LLM-judge + Greptile auto-close (any age, any draft state) + @agent-shin reconsider flow #28117 along since fix(triage): bugbot on #28117 — reconsider safety + 1-day grace + @greptileai post-close + SwiftWinds dogfood #28147 is stacked on it.
  3. Merged claude/agent-shin-pr-demo-UhrOx (Agent Shin: "ready for review" label lifecycle + end-to-end demo #28758). Eight files conflicted: Agent Shin: "ready for review" label lifecycle + end-to-end demo #28758 was built on an older snapshot of feat(triage): Agent Shin — LLM-judge + Greptile auto-close (any age, any draft state) + @agent-shin reconsider flow #28117 (its 0ad6836e31 commit literally vendored an old version), so the constants/imports in triage_with_llm.py and close_low_quality_prs.py diverged. Resolved by taking fix(triage): bugbot on #28117 — reconsider safety + 1-day grace + @greptileai post-close + SwiftWinds dogfood #28147's (newer, more complete) versions of the duplicated files and keeping Agent Shin: "ready for review" label lifecycle + end-to-end demo #28758's unique additions (review_gate.yml, test_github_review_gate.py, the proxy/UI changes, issue templates).
  4. Manually ported the actual review-gate feature commit (b4d970371e) onto fix(triage): bugbot on #28117 — reconsider safety + 1-day grace + @greptileai post-close + SwiftWinds dogfood #28147's structure since a cherry-pick conflicted: added the review_gate() state machine, the --review-gate / --grace-days / --min-greptile-score CLI flags, new constants (READY_FOR_REVIEW_LABEL, READY/REGRESSED/WITHIN_GRACE markers, AGENT_SHIN_AUTO_CLOSE_MARKER, GREPTILE_BOT_LOGINS, SCORE_PATTERN), and new helpers (add_label, remove_label, extract_greptile_score, parse_iso8601, format_*_comment).

Test plan

  • tests/test_litellm/test_github_review_gate.py — 18 tests, all pass
  • tests/test_litellm/test_github_triage_with_llm.py — passes
  • tests/test_litellm/test_github_close_low_quality_prs.py — passes
  • tests/test_litellm/test_github_triage_workflows.py — passes
  • python3 .github/scripts/triage_with_llm.py --help shows --review-gate, --grace-days, --min-greptile-score
  • uv run black --check clean on edited files

162 GitHub-triage tests pass total.

Reviewer notes

  • The three source PRs should stay open until this combined PR merges; closing them prematurely would lose review history.
  • .github/workflows/review_gate.yml runs in dry-run mode unless the repo variable AGENT_SHIN_ENABLED is set to true.

🤖 Generated with Claude Code


Note

Medium Risk
Automated closing, labeling, and commenting on external PRs/issues can misfire or upset contributors; mitigated by dry-run defaults, grace periods, internal exemptions, and explicit enablement.

Overview
Introduces Agent Shin, an OSS triage stack under .github/scripts/ and matching workflows, gated by repo variable AGENT_SHIN_ENABLED (dry-run until set to true).

Contributor-facing templates now spell out auto-triage rules: bug/feature issue forms require repro, expected vs actual, and proof; the PR template adds problem/expected-actual/QA sections and documents Greptile ≥4/5 plus @agent-shin reconsider.

Core automation: triage_with_llm.py LLM-judges external PRs/issues (linked issue or rubric + QA proof), 24h grace before close, @agent-shin reconsider with bot-only reopen guards and rate limits. close_low_quality_prs.py sweeps open external PRs below Greptile 4/5 (drafts included). review_gate() adds a ready for review label lifecycle (LLM + Greptile, regression handling, auto-close after grace). triage_rollout_heads_up.py posts a one-shot 7-day rollout warning. Shared helpers live in agent_shin_shared.py; _agent_shin_actions.py centralizes dry-run vs real GitHub writes.

Workflows: PR/issue triage, reconsider on comment, review gate (PR events + daily sweep), daily low-quality closer, rollout heads-up on staging merge. Tests cover review gate, Greptile closer, and related triage logic under tests/test_litellm/.

Reviewed by Cursor Bugbot for commit dd02b76. Bugbot is set up for automated code reviews on this repo. Configure here.

cursoragent and others added 22 commits May 17, 2026 09:24
Adds .github/scripts/close_low_quality_prs.py and a daily workflow that
closes PRs which:
  - are open for at least 7 days, and
  - carry a most-recent greptile-apps review with Confidence Score <4/5,
  - and are not drafts or opt-out-labeled ('do not close', 'wip', etc.).

Each closure posts an explanatory comment telling the contributor how to
bring the PR back (rebase, re-request greptile, reopen at 4+/5). The
4/5 bar is already documented in the PR template
(.github/pull_request_template.md), so this just enforces it.

Tested with a dry run against the live BerriAI/litellm backlog of 1000
open PRs: 100 candidates identified, 598 PRs pass the bar (4+/5), 186
are too young, 97 are drafts, 19 lack any Greptile review and are left
alone.

Workflow defaults to closing 25 PRs/run as a safety net and supports
workflow_dispatch with overrides (close=false for a dry run, custom
min_age_days/min_score/limit).

18 unit tests cover score extraction (HTML/markdown/plain text, login
variants, multi-review picks latest) and per-PR evaluation (drafts,
opt-out labels, age, missing/passing/failing scores).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…ributions

PR template:
- Make the rubric explicit at the top: link an issue, OR provide a clear
  problem description + expected vs. actual + visual QA proof.
- Add dedicated sections for each piece so the bot has a deterministic
  shape to read.
- Keep the existing 'Linear ticket' section for internal contributors
  (they're exempt from the auto-triage rubric).

Bug report template:
- Split 'What happened?' into 'Actual behavior' + 'Expected behavior'.
- Make logs/screenshot a required textarea.
- Warning banner at the top tells external contributors that incomplete
  reports will be auto-closed (with re-evaluation on reopen).

Feature request template:
- Require a concrete use case + example in the motivation field, not just
  a one-liner pitch.
- Same auto-triage warning banner.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Adds a new triage flow that evaluates external pull requests and issues
against the project's contribution rubric and, when configured to do so,
auto-closes non-conforming ones with an explanatory comment. Contributors
can update + reopen to be re-evaluated.

Scope:
- Internal BerriAI contributors (author_association OWNER/MEMBER/COLLABORATOR)
  and bot accounts are skipped entirely.
- 'Fixes #1234' / 'Resolves https://github.com/.../issues/N' in the PR body
  short-circuits to PASS without burning LLM tokens.
- LLM judge returns structured JSON (verdict, missing[], explanation);
  parser tolerates markdown fences and embedded JSON.
- LLM errors NEVER close PRs/issues — failure surfaces as 'skip-llm-error'.

Safety:
- pull_request_target / issues triggers are FORCED dry-run in the workflow;
  only manual workflow_dispatch with close=true (and AGENT_SHIN_ENABLED=true)
  takes destructive action.
- Default mode writes verdicts to GITHUB_STEP_SUMMARY only — no public
  comments until the team flips the AGENT_SHIN_ENABLED repo variable.
- LLM uses an OpenAI-compatible endpoint (model and base URL configurable
  via repo variables; key via OPENAI_API_KEY secret).

Files:
- .github/scripts/triage_with_llm.py   - judge orchestrator + CLI
- .github/workflows/triage_pr_with_llm.yml
- .github/workflows/triage_issue_with_llm.yml
- tests/test_litellm/test_github_triage_with_llm.py - 33 unit tests

End-to-end validated against four real PRs (#28117 internal collaborator,
#28108 bot, #28129 'Fixes #28128', #28116 no linked issue) and issue
#28132 with a stubbed LLM judge: each path produces the expected action.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…ry-run by default

- close_low_quality_prs.py now filters by GitHub author_association via
  the REST API: PRs from OWNER / MEMBER / COLLABORATOR (and bot accounts)
  are skipped with a new 'skip-internal' summary bucket.
- close_low_quality_prs.yml now defaults workflow_dispatch close=false,
  and ignores 'close=true' unless the new repo variable
  AGENT_SHIN_ENABLED is set to 'true'. Scheduled runs are dry-run only
  until the team flips that switch.
- Updated unit tests: one new test asserting internal authors are
  skipped, and an autouse fixture treats unspecified test PRs as
  external so the rest of the suite still exercises the close path.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…rpolation

- close_low_quality_prs.yml: only workflow_dispatch with close=true (and
  AGENT_SHIN_ENABLED=true) actually closes PRs. Scheduled runs are always
  dry-run, matching the safety invariant documented for triage_pr/issue.
- triage_with_llm.py: textwrap.dedent on an f-string with multi-line
  interpolated bodies fails because the body's 2nd+ lines start at column 0,
  making the common-indent zero. Dedent the static template first, then
  .format() the title/body in.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
- close_low_quality_prs.py: Treat author_association API lookup failures
  as internal (fail-safe) so transient errors don't cause internal
  contributors' PRs to be auto-closed.
- triage_with_llm.py: Update summary heading from 'Would post comment:'
  to 'Posted comment:' since this branch only runs after the comment
  has already been posted.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
…t=none

- Bump DEFAULT_MODEL from gpt-4o-mini to gpt-5.4-mini (more modern;
  4M total context window per OpenAI catalog, JSON-schema response
  format, function calling all supported).
- For gpt-5.x family models, pass reasoning_effort="none" via
  extra_body. gpt-5.x rejects temperature != 1 unless reasoning_effort
  is explicitly "none"; setting it lets us keep temperature=0 for
  deterministic JSON rubric judgments. extra_body works across openai
  SDK versions regardless of whether they natively type the kwarg.
- For non-gpt5 overrides (TRIAGE_MODEL=gpt-4o-mini etc.), reasoning_effort
  is not sent.
- 4 new unit tests cover: gpt-5.4-mini -> reasoning_effort=none,
  capitalized/dated gpt-5 variants -> reasoning_effort=none,
  gpt-4o-mini -> no extra_body, base_url passthrough.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…-with-default

- Removed the unused gh_json helper (bugbot low-severity dead code).
- Replaced argparse `action="append", default=[...]` with default=None
  + DEFAULT_OPTOUT_LABELS fallback. The mutable-default + append combo
  silently APPENDS to the canonical defaults instead of replacing them,
  so --optout-label could not actually scope the opt-out list.
- Added tests covering both the canonical default and the
  flag-replaces-defaults behavior.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…sociation, fix empty TRIAGE_MODEL

Three independent bugbot findings against triage_with_llm.py:

1. LINKED_ISSUE_PATTERN included weak keywords (`see`, `ref`,
   `addresses`) so casual mentions like "See #1234 for context" were
   short-circuited to pass-linked-issue without ever calling the LLM —
   contradicting the prompt's own "a bare issue number without a closing
   keyword counts only if it's clearly the related issue (not a passing
   mention)" rubric. Limit the regex to GitHub's documented PR-closing
   keywords (fixes/fix/fixed/closes/close/closed/resolves/resolve/resolved).

2. is_internal_contributor() treated an empty/missing author_association
   as external (eligible for the destructive close path), while the sibling
   is_external_pr_author() in close_low_quality_prs.py fail-safes the same
   case as internal. Align the two so a partial/unknown GitHub response can
   never make a PR eligible for auto-close.

3. argparse `default=os.environ.get("TRIAGE_MODEL", DEFAULT_MODEL)` returns
   the empty string when GitHub Actions exposes an unset repo variable as
   an empty-string env var (the optional vars.TRIAGE_MODEL case in the
   workflow). Use `os.environ.get(...) or DEFAULT_MODEL` so empty -> default,
   matching the existing OPENAI_BASE_URL pattern.

Tests:
- Casual mentions now must fall through to the LLM (parametrized);
  added an orchestration test ensuring "See #1234" reaches the judge.
- Empty/missing author_association now fails safe (parametrized).
- Empty TRIAGE_MODEL env var falls back to DEFAULT_MODEL; explicit
  TRIAGE_MODEL is still honored.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…false'

The PR and issue Agent Shin workflows gated the destructive --close
flag with [ "${DISPATCH_CLOSE:-false}" != "false" ]. That pattern
treats anything other than the literal string "false" as enabling
closure — "True", "yes", "1", typos, accidental whitespace, etc.
The workflow_dispatch input UI is a 'true'/'false' choice dropdown so
the form is constrained, but the API (`gh workflow run -f close=...`)
accepts any string, and a CI cron / external invoker passing a
non-canonical truthy value would have silently enabled real
contributor PR closures.

Mirror the sibling Greptile closer's [ "${CLOSE_FLAG}" = "true" ]
pattern: only the EXACT string "true" enables --close; every other
value (including the unset/empty default) resolves to dry-run. This is
the fail-safe philosophy applied everywhere else in this PR.

Added tests/test_litellm/test_github_triage_workflows.py with two
parametrized invariants:
  1. The destructive gate uses '= "true"' for its env-var
     comparison (either bare '${ENV}' or '${ENV:-false}' form
     accepted), and never the fail-open '!= "false"' pattern.
  2. Every destructive gate is also gated on AGENT_SHIN_ENABLED being
     "true" — either by entering the close branch on '=' or by
     bailing out early on '!=' — so flipping the repo variable off is
     a true kill switch regardless of per-run inputs.

Manually verified the test fails on the buggy '!= "false"' pattern and
passes on the fix, so it would have caught the regression at PR time.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…econsider flow

Follow-up to PR #28117. Three behavior changes + one new workflow,
addressing the team's concerns on the original review:

1) Apply auto-close to ALL open PRs, not just those over a week old.

   - close_low_quality_prs.py: --min-age-days default flipped from 7 to
     0. The flag is preserved as an opt-in safety net for one-off
     backfill runs that want to spare very-young PRs, but the daily
     scheduled sweep now closes external-author PRs as soon as Greptile
     scores them <4/5.
   - close_low_quality_prs.yml: workflow_dispatch input default also
     flipped to 0; doc comments updated.

2) Apply auto-close to draft PRs too.

   - close_low_quality_prs.py: removed the skip-draft branch in
     evaluate_pr. Drafts are NOT a free pass — the team's intent is
     'open PR count == PRs internal collaborators need to action on',
     so a draft Greptile scored 2/5 still belongs in the closed bucket.
     Authors who genuinely need a long-lived draft can attach the 'wip'
     opt-out label, which is unchanged.
   - The 'skip-draft' action is gone; the 'wip' label still skips.

3) Address the 'OSS contributors cannot reopen a bot-closed PR' wrinkle.

   GitHub does NOT let an external (non-write-access) contributor
   reopen a PR that was closed by a bot or maintainer (long-standing
   limitation). The original PR's close-comments told contributors to
   'Reopen the PR — I'll re-evaluate automatically', which is broken
   for the very audience this triage targets. Two changes:

   a) Reword every close-comment (Greptile sweep + Agent Shin PR
      close + Agent Shin issue close + PR template) to recommend:
        - Open a new PR with the updated branch (primary path).
        - Or comment '@agent-shin reconsider' on the closed PR for a
          re-evaluation that, on pass, reopens the PR via the bot's
          GH_TOKEN write access.

   b) Add the @agent-shin reconsider workflow:
        - .github/workflows/triage_reconsider.yml: new
          'issue_comment'-triggered workflow. Authorizes only the
          PR/issue author or an internal collaborator
          (OWNER/MEMBER/COLLABORATOR), gated via a step output so
          unauthorized commenters never reach the destructive steps.
          Globally gated on AGENT_SHIN_ENABLED='true' (positive form,
          matching the test_github_triage_workflows guardrail
          patterns).
        - triage_with_llm.py: --reconsider mode. On a closed PR/issue,
          re-runs the LLM judge (or linked-issue regex short-circuit)
          and:
            - on pass: reopens via reopen_pr/reopen_issue + posts a
              'Re-evaluated and reopened' comment.
            - on fail: leaves closed and posts a 'still missing X'
              comment so the contributor can iterate again.
          Reconsider-on-open is a no-op ('skip-not-closed').
          Internal-author + bot-account skips still take priority over
          reconsider.

4) Greptile-on-closed-PRs question: the team asked whether Greptile can
   re-review a closed PR. Greptile's docs don't address this and we
   shouldn't promise behavior we can't verify, so the new close-comment
   wording does NOT instruct contributors to 're-request greptile on
   the closed PR'. Instead it points them at the new-PR path (which
   Greptile definitely reviews) or the @agent-shin reconsider trigger
   (which re-runs the LiteLLM-side rubric judge, not Greptile).

Tests: 93 passing (was 59).

  - test_github_close_low_quality_prs.py: replaced 'skip drafts' test
    with 'closes drafts when score is low' + 'closes brand-new PR when
    min_age=0' + 'no skip when min_age=0'. The 'skip too young'
    assertion is preserved as opt-in.
  - test_github_triage_with_llm.py: 6 new TestTriageOrchestration cases
    for reconsider mode (skip-not-closed on open, reopen on pass,
    still-failing comment on fail, linked-issue short-circuit reopen,
    skip internal author in reconsider, reopen-issue on pass) + a new
    TestCloseCommentText class that pins the user-facing 'open a new
    PR' + '@agent-shin reconsider' wording.
  - test_github_triage_workflows.py: added triage_reconsider.yml to
    the destructive-gate guardrail table; AGENT_SHIN_ENABLED is its
    own destructive gate (no separate per-run flag needed).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Adds regression tests covering the bugbot high-severity finding that
str.format() would crash on user-supplied content containing { or }.
Empirically str.format() does NOT re-parse interpolated values — only
the template literal is scanned for replacement fields — so the bug
does not exist in the current code, but pinning the safe behavior
prevents a future templating change from silently reintroducing it.

Also pins the dedented prompt shape (no leading 8-space indentation on
template lines) so a future change to the build_*_prompt functions can't
silently regress the LLM judge prompt format on multi-line bodies.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
Address three Greptile/veria-ai concerns on the @agent-shin reconsider
flow:

1. **Reconsider had no dry-run path.** The previous reconsider mode
   ignored `--close` and always posted comments + reopened on a pass.
   A local operator running
   `python triage_with_llm.py --reconsider --pr N` would silently
   take destructive GitHub actions with no way to preview. Reconsider
   now honors `close=False` the same way regular triage does and
   returns `would-reopen` / `would-reconsider-still-failing` for
   step-summary rendering.

2. **Reconsider could reopen maintainer-closed PRs/issues** (Medium
   security finding from veria-ai). The workflow only checked that the
   commenter was authorized — it did NOT check that the most recent
   close was performed by Agent Shin. A contributor could comment
   `@agent-shin reconsider` on a PR a maintainer closed for non-rubric
   reasons (duplicate, security report, design rejection) and have the
   bot reopen it. Add `was_closed_by_agent_shin()` which inspects the
   issue events API for the most recent `closed` actor and only
   permits reopen when that actor matches the configured bot login
   (default `github-actions[bot]`, overridable via env). Fail-closed
   on missing events.

3. **No rate-limiting on the reconsider trigger.** Every
   `@agent-shin reconsider` comment burns CI minutes + an OpenAI API
   call. Add a 10-minute cooldown via
   `seconds_since_last_reconsider_verdict()` which greps the issue's
   comment list for the bot's own verdict marker
   (`<!-- agent-shin:reconsider-verdict -->`). Inside the window the
   triage returns `skip-rate-limited` and the LLM never runs.

Workflow update:
- `triage_reconsider.yml` now passes `--close` only when
  `AGENT_SHIN_ENABLED=true`, matching the pattern of
  `triage_pr_with_llm.yml`. The script runs in both states so the
  verdict still appears in the step summary for QA.

Tests:
- Add 5 reconsider safety tests: dry-run for pass / fail / linked-issue
  short-circuit, bot-closed-guard refusal on maintainer close,
  rate-limit refusal inside the cooldown window, and cooldown-elapsed
  acceptance.
- Add unit tests for `was_closed_by_agent_shin` (bot / maintainer /
  missing actor / env-override) and
  `seconds_since_last_reconsider_verdict` (no marker / multiple
  markers / non-bot comment with marker / bot comment without marker).
- Pin the `<!-- agent-shin:reconsider-verdict -->` marker in both
  reopen and still-failing comments — dropping it would silently
  break the cooldown.

Existing reconsider tests updated to pass `close=True` (the
production path now) + stub the new guards via
`_stub_reconsider_guards`. 112 tests pass (was 93).

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…close bypass

- Add a 24-hour grace window between the first low-quality detection
  and the actual auto-close. The first detection posts a warning
  comment that explicitly says "You have 1 day to address this before
  this PR is auto-closed" and points the contributor at:
    * `@agent-shin reconsider` to request another look (and re-open)
    * `@greptileai` to request a fresh Greptile review — works
      even after the PR is closed
- Both `triage_with_llm.py` (LLM judge) and `close_low_quality_prs.py`
  (Greptile-score closer) share the same `<!-- agent-shin:grace-warning -->`
  HTML marker so a warning posted by either path is recognized by both.
- Add IMMEDIATE_CLOSE_LOGINS = {swiftwinds} to bypass BOTH the grace
  period AND the dry-run / AGENT_SHIN_ENABLED gating. SwiftWinds is the
  user's personal account (no push permissions to litellm) used to
  dogfood the bot; user explicitly asked: "For SwiftWinds, just close
  immediately. Faster iteration that way."
- Update the standard close comments to mention that `@greptileai`
  works even after the PR is closed.
- Add 23 new tests covering: warn-grace on first detection, skip during
  grace window, close after grace expires, SwiftWinds bypass (case
  insensitive, with close=False, no random-login false positives), the
  grace-warning text invariants, and the SwiftWinds entry in the
  IMMEDIATE_CLOSE_LOGINS constant.

Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
For PRs from IMMEDIATE_CLOSE_LOGINS (e.g. swiftwinds), evaluate_pr
returns 'close' immediately without ever posting a grace warning, so
the close comment should not reference a 1-day grace period.

Make close_pr take a grace_period_elapsed flag, default True, and
pass False from the main loop when the close path was the
immediate-close branch.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
IMMEDIATE_CLOSE_LOGINS PRs are closed even when the global --close flag is
not set, but the summary used the global dry-run flag to choose between
'would close' and 'closed'. Split the count so operators can see both
actual closures and dry-run would-be closures.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
Brings the Agent Shin OSS-triage scripts, workflows, issue/PR templates, and
tests from PR #28117 onto this branch so the new review-gate feature and its
end-to-end demo are self-contained and runnable in CI.

https://claude.ai/code/session_01XyyWa8t2VYmoGd6mKMEqkZ
Adds review_gate(), a state machine that keeps a `ready for review` label in
sync with whether an external PR clears BOTH gates — the LLM rubric and
Greptile's most recent confidence score:

- pass (untagged)            -> add label + "ready for review" / "all clear" comment
- pass (already tagged)      -> no-op (idempotent across re-runs)
- regress (Greptile < 4/5 or QA proof removed) -> remove label + "what's missing"
  comment, PR stays open
- recover after a regression -> "all clear again" comment + re-add the label
- fail & untagged, < 24h old -> one-time "what's missing" notice (grace window)
- fail & untagged, > 24h old -> close + comment (reopen via @agent-shin reconsider)

The label itself is the persisted state, so comments fire only on transitions
(never on every scheduled run). All side effects are gated behind --close, so
the dry-run contract matches the existing triage flow. Lifecycle comments use
hidden HTML markers and deliberately avoid the auto-close marker so they never
trip the reconsider provenance check.

Relocates the shared Greptile helpers (extract_greptile_score, SCORE_PATTERN,
GREPTILE_BOT_LOGINS, parse_iso8601) into triage_with_llm.py so the daily sweep
and the review gate read the score through one implementation, and adds the
review_gate.yml workflow (dry-run unless AGENT_SHIN_ENABLED=true) plus 18 unit
tests covering every branch and a full pass->regress->recover cycle.

https://claude.ai/code/session_01XyyWa8t2VYmoGd6mKMEqkZ
…eview_gate feature applied in next commit
Adds the "ready for review" label lifecycle (originally PR #28758) on top
of #28147's refactored triage_with_llm.py. The original commit was
authored against an older snapshot of #28117 and could not be applied
cleanly, so the additions were re-applied surgically:

- New constants: READY_FOR_REVIEW_LABEL, DEFAULT_GRACE_DAYS,
  DEFAULT_MIN_GREPTILE_SCORE, READY/REGRESSED/WITHIN_GRACE markers,
  GREPTILE_BOT_LOGINS, SCORE_PATTERN, AGENT_SHIN_AUTO_CLOSE_MARKER.
- New helpers: add_label, remove_label, extract_greptile_score,
  parse_iso8601 (the latter two mirrored from close_low_quality_prs.py
  so the daily sweep and the review gate read the score through the
  same logic).
- New comment formatters: format_ready_for_review_comment,
  format_all_clear_comment, format_regression_comment,
  format_within_grace_comment.
- New entry point: review_gate() implementing the pass/regress/recover
  state machine, with the label itself acting as persisted state so
  transition comments fire only on actual transitions.
- main() learns --review-gate, --grace-days, --min-greptile-score and
  dispatches to review_gate() when the flag is set.

Verified via tests/test_litellm/test_github_review_gate.py (18 tests)
and the existing triage suites (144 more) — all 162 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mateo-berri mateo-berri requested a review from a team May 25, 2026 01:01
@CLAassistant

CLAassistant commented May 25, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 4 committers have signed the CLA.

✅ mateo-berri
❌ claude
❌ cursoragent
❌ Mateo


Mateo seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov

codecov Bot commented May 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR combines three branches into a single Agent Shin automation: an LLM-as-judge triage bot that auto-closes low-quality external PRs, a @agent-shin reconsider comment-trigger reopen flow, and a ready for review label state machine (review_gate) that tags passing PRs and removes the tag on regression.

  • triage_with_llm.py is the core orchestration script (~1 500 lines) implementing triage(), review_gate(), reconsider logic, grace periods, and all GitHub write helpers; most new workflows invoke it with different flag combinations.
  • close_low_quality_prs.py is a separate daily cron script that sweeps all open PRs and closes those where Greptile scored below the threshold, with its own grace-warning flow.
  • Five new workflows wire up the triggers (pull_request_target, issue_comment, daily cron) with explicit AGENT_SHIN_ENABLED dry-run gating; 162 mock-only tests are added in tests/test_litellm/.

Confidence Score: 3/5

The workflows contain two issues that can produce real GitHub write operations (PR closures, comments) before the team intends to enable the bot.

In triage_pr_with_llm.yml, OPENAI_API_KEY is always injected into the step regardless of AGENT_SHIN_ENABLED, causing live LLM calls for every inbound external PR the moment the secret is configured. More critically, the IMMEDIATE_CLOSE_LOGINS bypass in triage() ignores the --close flag entirely — the workflow strips --close for pull_request_target events and prints 'No PRs will be closed', but PRs from swiftwinds are still commented on and closed even when the bot is nominally disabled. The review_gate.yml handles the key guard correctly and is otherwise sound. The test suite is thorough and all mock-based.

.github/workflows/triage_pr_with_llm.yml (unconditional API key) and .github/scripts/triage_with_llm.py lines 1383–1406 (IMMEDIATE_CLOSE_LOGINS bypass ignores --close).

Important Files Changed

Filename Overview
.github/scripts/triage_with_llm.py New 1545-line Agent Shin orchestration script: LLM-judge triage, reconsider flow, and review_gate() state machine. Contains the IMMEDIATE_CLOSE_LOGINS bypass that ignores --close for swiftwinds, a dead AGENT_SHIN_AUTO_CLOSE_MARKER constant, unfiltered _has_marker() author checks, and a grace-period approach in review_gate() that differs from triage() in ways that close PRs without prior warning on first deployment.
.github/workflows/triage_pr_with_llm.yml New PR triage workflow using pull_request_target. Exposes OPENAI_API_KEY unconditionally (unlike review_gate.yml which guards the key behind AGENT_SHIN_ENABLED), causing LLM calls even when the bot is supposed to be in dry-run mode.
.github/workflows/review_gate.yml New review gate workflow (pull_request_target + daily cron). Correctly guards OPENAI_API_KEY behind AGENT_SHIN_ENABLED. DO_CLOSE logic is sound. No significant issues.
.github/scripts/close_low_quality_prs.py New 645-line daily cron closer. Uses time-since-warning approach for grace period (always warns before closing). IMMEDIATE_CLOSE_LOGINS bypass is in evaluate_pr() but does not bypass dry_run flag at the caller level, so behavior is consistent.
.github/workflows/triage_reconsider.yml New reconsider workflow triggered by issue_comment. Authorization check (commenter is author or internal collaborator) is correctly gated before any destructive steps.
tests/test_litellm/test_github_review_gate.py 18 new unit tests for review_gate() state machine. All use mocks; no real network calls. Coverage is thorough for labeled/unlabeled/within-grace/past-grace paths.
tests/test_litellm/test_github_triage_with_llm.py Large test file for triage(), reconsider, grace period, and IMMEDIATE_CLOSE_LOGINS behavior. All mock-based. Explicitly validates swiftwinds bypass as intentional design.
tests/test_litellm/test_github_close_low_quality_prs.py New mock-only tests for the daily closer. No real network calls. Coverage looks solid for grace period, immediate close, and internal author exemption paths.

Reviews (1): Last reviewed commit: "Port review-gate feature from #28758 ont..." | Re-trigger Greptile

Comment on lines +1383 to +1388
# accounts (`is_immediate`) ignore `--close` so a workflow that
# forces dry-run for the global population can still take real
# action on those test accounts.
if not close and not is_immediate:
return {**base_result, "action": "would-close", "verdict": verdict}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 IMMEDIATE_CLOSE_LOGINS bypass fires even when AGENT_SHIN_ENABLED is not set

triage() closes and comments on swiftwinds PRs regardless of whether --close was passed (if not close and not is_immediate: falls through when is_immediate=True). The triage_pr_with_llm.yml workflow always strips --close for pull_request_target events and prints ::notice:: No comments will be posted; no PRs will be closed — but that guarantee is violated for any login in IMMEDIATE_CLOSE_LOGINS. An operator who has not yet set AGENT_SHIN_ENABLED=true and believes the bot is fully inert will still see live PR closures and comment posts for those accounts.

Comment on lines +63 to +64
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENAI_BASE_URL: ${{ vars.OPENAI_BASE_URL }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 OPENAI_API_KEY exposed unconditionally, burning API budget in dry-run mode

OPENAI_API_KEY is always injected into the step, so the LLM judge is called for every external PR that lacks a linked-issue short-circuit, even when AGENT_SHIN_ENABLED is not set. The sister workflow review_gate.yml guards this correctly with (vars.AGENT_SHIN_ENABLED == 'true' || github.event_name == 'workflow_dispatch') && secrets.OPENAI_API_KEY || '', passing an empty string when the bot is disabled. Without a similar guard here, a high-volume repo in dry-run mode will silently consume LLM quota on every opened/reopened PR.

Suggested change
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENAI_BASE_URL: ${{ vars.OPENAI_BASE_URL }}
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
OPENAI_API_KEY: ${{ (vars.AGENT_SHIN_ENABLED == 'true' || github.event_name == 'workflow_dispatch') && secrets.OPENAI_API_KEY || '' }}

Comment on lines +119 to +127

# Marker phrase Agent Shin always includes in its auto-close comments
# (see `format_pr_close_comment` / `format_issue_close_comment`). The
# review-gate close path stamps the same marker so reconsider can recognize
# the closure as a bot close. Keep the marker in sync with the literal text
# in those formatter functions.
AGENT_SHIN_AUTO_CLOSE_MARKER = "I'm **Agent Shin**"

# Model families that require `reasoning_effort` to be set, and that reject

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 AGENT_SHIN_AUTO_CLOSE_MARKER is defined but never used in code logic

The constant AGENT_SHIN_AUTO_CLOSE_MARKER = "I'm **Agent Shin**" is defined with a comment saying "reconsider can recognize the closure as a bot close". However, was_closed_by_agent_shin() uses fetch_last_close_actor() (GitHub close-event actor) to detect bot closures — it never inspects comment text for this marker. The format functions also hardcode the string literal instead of referencing the constant. This creates a false sense that keeping the string in sync matters for correctness, and the constant will diverge silently from the formatter functions over time.

Comment thread .github/scripts/triage_with_llm.py Outdated
Comment on lines +890 to +891
def _has_marker(comments: Iterable[dict], marker: str) -> bool:
return any(marker in (comment.get("body") or "") for comment in comments)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _has_marker does not filter by comment author — user-injected HTML can suppress bot notifications

_has_marker(comments, marker) iterates all comments regardless of who wrote them. A contributor can paste <!-- agent-shin:within-grace --> or <!-- agent-shin:regressed --> in their own PR description or a comment, causing review_gate() to either skip the within-grace notification entirely (so the user misses the courtesy warning before closure) or always post "all clear again" instead of "ready for review" on the first pass. Both effects are benign in practice, but deduplicating on bot-authored comments would make the state machine robust to accidental or deliberate injection.

Comment on lines +1124 to +1131
# Not passing and not tagged: close if past the grace window, else notify once.
if age_days is not None and age_days >= grace_days:
comment = format_pr_close_comment({**verdict, "missing": missing})
if not close:
return {**base_result, "action": "would-close", "comment": comment}
post_comment(repo, number, comment)
close_pr(repo, number)
return {**base_result, "action": "closed", "comment": comment}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 review_gate() closes PRs older than grace_days without any prior warning on first encounter

Unlike triage(), which always posts a GRACE_COMMENT_MARKER warning on the first low-quality detection and only closes on a subsequent run after GRACE_PERIOD_SECONDS has elapsed, review_gate() closes any PR whose creation age exceeds grace_days (1 day) immediately — regardless of whether a warning was ever posted. On initial workflow deployment, the daily cron sweep will encounter all existing open external PRs that are older than 1 day and close them without any advance notice. close_low_quality_prs.py uses the time-since-warning approach and avoids this problem.

Comment thread .github/scripts/triage_with_llm.py Outdated
Comment thread tests/test_litellm/test_github_triage_workflows.py
Comment thread .github/workflows/triage_pr_with_llm.yml Outdated
@veria-ai

veria-ai Bot commented May 25, 2026

Copy link
Copy Markdown
Contributor

PR overview

The PR has worked down several findings, with 3 issues already addressed, but a few low-impact workflow abuse paths remain open. The remaining risks are operational rather than data-compromise risks: external users may be able to trigger unnecessary LLM/API spend, and a reconsider command may reopen items closed by other GitHub Actions because closures share the same bot actor. Overall security posture is improving, but these workflow guards should be tightened before merging.

Open issues (3)

Fixed/addressed: 3 · PR risk: 5/10

Comment thread .github/scripts/triage_with_llm.py
Comment thread .github/scripts/triage_with_llm.py Outdated
…ve label match

Co-authored-by: Yassin Kortam <yassin@berri.ai>

on:
pull_request_target:
types: [opened, reopened]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR triage workflow duplicates review gate LLM calls

Low Severity

Both triage_pr_with_llm.yml and review_gate.yml fire on the same pull_request_target events (opened, reopened). When AGENT_SHIN_ENABLED=true, both workflows call the LLM judge for the same PR on the same event. The triage workflow always strips --close on pull_request_target, making its automatic trigger purely observational — its entire function is already performed by the review gate. This results in paying for two independent LLM evaluations of every opened/reopened PR while only one produces any action.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 19dfdd4. Configure here.

Comment thread .github/scripts/triage_with_llm.py
return {**base_result, "action": "would-close", "comment": comment}
post_comment(repo, number, comment)
close_pr(repo, number)
return {**base_result, "action": "closed", "comment": comment}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review gate omits IMMEDIATE_CLOSE_LOGINS grace bypass

Low Severity

The review_gate() function's close path uses only PR age (age_days >= grace_days) to decide whether to close, without checking IMMEDIATE_CLOSE_LOGINS. In contrast, triage() explicitly checks login.lower() in IMMEDIATE_CLOSE_LOGINS to bypass grace entirely. The PR documentation states these logins "bypass BOTH the grace period AND the dry-run gating," but the review gate still subjects them to the full age-based grace window. A dogfood account opening a fresh PR would receive a "within-grace" notice from the review gate instead of being closed immediately.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 19dfdd4. Configure here.

# the closure as a bot close. Keep the marker in sync with the literal text
# in those formatter functions.
AGENT_SHIN_AUTO_CLOSE_MARKER = "I'm **Agent Shin**"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused constant misleadingly documented for reconsider detection

Low Severity

AGENT_SHIN_AUTO_CLOSE_MARKER is documented as enabling reconsider to "recognize the closure as a bot close," but was_closed_by_agent_shin() actually uses GitHub event-actor identity checking (not comment-body matching) for that purpose. The constant is never used for operational detection anywhere — only in a test assertion to pin comment text. The misleading documentation could cause a future developer to rely on comment-body matching that doesn't exist.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 19dfdd4. Configure here.

Adds a rollout-day workflow that comments on every open external PR/issue
that the new triage bot WOULD auto-close, giving contributors 7 days to
fix their description before any destructive action runs.

Why now: merging this PR enables Agent Shin in dry-run. The follow-up
"enact" PR (next Monday) flips the destructive paths on. Without this
heads-up, contributors would get a close-comment on day 8 with no prior
warning. The heads-up names the cutoff date, lists the rubric, calls out
each PR/issue's specific missing pieces, and explains the recovery paths
(@agent-shin reconsider for PRs, edit + reopen for issues).

Files
- .github/scripts/_agent_shin_actions.py — thin maybe_post_comment /
  maybe_close_* / maybe_add_label / etc. wrappers. Each is a single
  `if dry_run: log; return; else: call_through()` so a dry-run preview
  differs from the real run in exactly one call site per mutation. The
  call-through goes via `triage_with_llm.<name>` (module-qualified) so
  monkeypatching the underlying function in tests is reflected here.
- .github/scripts/triage_rollout_heads_up.py — the sweep. Iterates every
  open PR + issue via `gh pr list` / `gh issue list`, runs the future
  rubric (review_gate for PRs, triage(kind="issue") for issues), and
  posts the heads-up on any item that would be auto-closed. Idempotent
  via a `<!-- agent-shin:rollout-heads-up -->` marker. Defaults to dry-
  run; --close opts in to real posts. --close-on overrides the cutoff
  date (defaults to today + 7 days).
- .github/workflows/triage_rollout_heads_up.yml — one-shot workflow.
  Triggers on push to litellm_internal_staging filtered to the script
  path (fires on rollout merge) plus workflow_dispatch with a dry_run
  input that defaults to "true" for safe manual re-runs.
- tests/test_litellm/test_triage_rollout_heads_up.py — 28 unit tests
  covering: the dry-run wrappers (each maybe_* gates correctly), the
  _would_be_closed predicate for PR vs. issue results, the comment
  formatter (cutoff/rubric/marker/recovery wording), per-item dispatch
  (skip-not-open, skip-internal-author, skip-already-notified,
  skip-passing, would-post/posted), and the sweep loop end-to-end.

Local preview (no GitHub mutations):
    python3 .github/scripts/triage_rollout_heads_up.py --repo BerriAI/litellm

Real run (what the workflow does):
    python3 .github/scripts/triage_rollout_heads_up.py --repo BerriAI/litellm --close

TODO: replace the placeholder ROLLOUT_BLOG_URL with the canonical
docs URL once the litellm-docs PR ships.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread .github/scripts/_agent_shin_actions.py Outdated
Comment thread .github/workflows/triage_reconsider.yml Outdated
…appers

- Mirror sibling Agent Shin workflows by only exposing OPENAI_API_KEY in
  triage_reconsider.yml when vars.AGENT_SHIN_ENABLED == 'true'. Previously
  the secret was unconditionally exposed, so any PR/issue author could
  trigger paid LLM calls by commenting '@agent-shin reconsider' even while
  the bot was supposed to be in dry-run.
- Remove the six unused dry-run wrappers (maybe_close_pr, maybe_close_issue,
  maybe_reopen_pr, maybe_reopen_issue, maybe_add_label, maybe_remove_label)
  from _agent_shin_actions.py — only maybe_post_comment is used by rollout
  scripts. Drop the associated tests that exercised the now-removed
  functions.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
Comment thread .github/scripts/triage_rollout_heads_up.py Outdated
Comment thread .github/scripts/triage_with_llm.py
Comment thread .github/scripts/close_low_quality_prs.py Outdated
actor = fetch_last_close_actor(repo, number)
if not actor:
return False
return actor.lower() == expected

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Reconsider can reopen items closed by other GitHub Actions

An external issue/PR author can comment @agent-shin reconsider and reopen an item closed by another workflow, such as stale or duplicate handling, because closures performed with GITHUB_TOKEN all have the same github-actions[bot] actor. Stamp Agent Shin close comments with a dedicated hidden marker and require the latest close to be accompanied by that marker before allowing the reopen path.

- triage_rollout_heads_up.py: replace %-d strftime specifier (GNU-only)
  with portable day formatting so the script doesn't crash on Windows.
- close_low_quality_prs.py: skip malformed JSON lines in fetch_pr_comments
  instead of letting one bad line abort the daily sweep, matching the
  pattern in triage_with_llm._iter_paginated_json.
- triage_with_llm.py: move has_linked_issue short-circuit before
  build_pr_prompt to avoid unnecessary prompt construction on PRs that
  link an issue.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
Comment thread .github/scripts/close_low_quality_prs.py Outdated
Comment thread .github/scripts/close_low_quality_prs.py Outdated
…e_low_quality_prs

- Wrap per-PR processing in try/except so a transient GitHub API failure
  on one PR no longer aborts the entire daily sweep (mirrors the pattern
  already used in triage_rollout_heads_up.py).
- Have --limit bound *all* destructive write actions (closures and grace
  warnings combined), not just closures. Prevents a backlog of newly
  failing PRs from flooding contributors with comments in a single run.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
passing = result.get("passing")
if passing is None:
return False # skipped — nothing for the heads-up to warn about
return passing is False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads-up script misidentifies regressed PRs as closable

Low Severity

_would_be_closed returns True for any PR result where passing is False, including the regressed-already-notified and would-remove-label actions from review_gate. Those PRs are explicitly promised to stay open (the regression comment says "The PR stays open"), yet the heads-up script would post a warning claiming they'll be auto-closed. The function conflates "not currently passing" with "will be auto-closed."

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 09720a7. Configure here.

mateo-berri and others added 3 commits May 28, 2026 03:04
Both bulk-sweep scripts hardcoded `gh {pr,issue} list --limit 1000`, and gh
lists newest-first — so the OLDEST ~900 PRs and ~380 issues were silently
dropped. That's exactly the stale backlog the daily closer and one-shot
rollout heads-up exist to catch.

Extract a single `list_open_items(kind, *, repo, fields)` helper into
`agent_shin_shared.py` with `GH_LIST_ALL_LIMIT = 100_000` — a ceiling far
above any realistic open backlog so gh paginates until the queue is
exhausted. `fetch_open_prs` and `_list_open_numbers` both delegate to it,
so the limit lives in exactly one place going forward.

Verified live against BerriAI/litellm:
- `fetch_open_prs` -> 1981 PRs (was 1000)
- `_list_open_numbers(issue)` -> 1382 issues (was 1000)
- `_list_open_numbers(pr)` -> 1981 PRs (was 1000)

Adds 7 regression tests asserting the new limit is passed, the dedicated
`gh {pr,issue} list` command + fields are used per kind, bad kind raises
ValueError, and both callers delegate to the shared helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The PR rubric previously passed any PR with a linked issue, regardless
of whether it showed the fix actually working. Sample spot-check found
21/25 recent external PRs passing, including ones that linked an issue
but provided zero QA evidence.

Tighten the rubric so a pass now requires BOTH:

  (1) CONTEXT — a linked issue OR a clear problem description with
      expected-vs-actual behavior.
  (2) END-TO-END QA PROOF — at least one of:
      (a) screenshot(s) of the fix working,
      (b) screen recording / video,
      (c) specific commands actually run, paired with their real
          output, against the real system.

Mocked unit tests, generic 'I tested it' claims, 'all tests pass'
without output, and the linked issue itself are explicitly excluded
from QA proof.

Also add 'qa_proof_type' to the JSON schema so the per-PR report
surfaces which kind of proof (or 'none') the judge saw.

Re-sample on the same 25 recent external PRs shifts the verdict
distribution from 21 pass / 4 fail to 4 pass / 21 fail, with zero
prior-fails now passing — the stricter rule catches PRs that ship
only with unit-test claims and no real integration evidence.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…comment

Adds "What's this and why am I getting it?" links to docs.litellm.ai/blog/
agent-shin-triage from the four comments contributors actually read when
something went wrong: PR close, PR grace warning, issue close, issue grace
warning. PR comments also link the rubric section directly from the
QA-proof bullet so contributors can self-serve "what counts as proof"
without pinging a maintainer.

Pins the new guarantees in tests: blog link must appear in all four
comments, and the PR close comment must continue to flag mocked-dependency
unit tests as insufficient proof.

The linked blog post is in BerriAI/litellm-docs PR #240; the URL will 404
until that lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread .github/workflows/review_gate.yml Outdated
…IST_ALL_LIMIT

gh lists newest-first, so capping at 1000 silently drops the oldest open
PRs — exactly the stale ones the daily sweep is meant to reconcile. Use
the same ceiling as agent_shin_shared.GH_LIST_ALL_LIMIT so the workflow
sees the entire backlog.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
Comment thread .github/scripts/triage_with_llm.py Outdated
Comment thread .github/scripts/triage_with_llm.py
Comment thread .github/scripts/triage_rollout_heads_up.py
- review_gate: expire the regression-marker short-circuit after grace_days
  so PRs that were regressed and then abandoned can eventually be closed.
- review_gate: when the rubric short-circuits to pass via the linked-issue
  regex but Greptile drags the PR below the bar, replace the synthetic
  'LLM was not called' explanation with the real Greptile shortfall so
  regression / close comments are not misleading.
- triage_rollout_heads_up._comments_have_marker: drop the unused 'kind'
  parameter and filter by bot author so a contributor quoting the
  heads-up via 'Quote reply' cannot trick the idempotency check, matching
  the pattern in triage_with_llm._has_marker.

Co-authored-by: Yassin Kortam <yassin@berri.ai>
# Manual dry-run: gh workflow run "Agent Shin — review gate" -f close=false
#
# We use `pull_request_target` so the workflow can read repo secrets and run
# against fork PRs. Fork code is never checked out — only PR metadata is read

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Repeated LLM spend from fork PR updates

When AGENT_SHIN_ENABLED is true, every synchronize event on a fork PR gets OPENAI_API_KEY and runs the review gate, so an external PR author can burn LLM/API budget by repeatedly pushing commits. Consider removing synchronize here and relying on the scheduled/manual runs for re-checks, or add a per-PR cooldown before exposing the LLM key.

Suggested change
# against fork PRs. Fork code is never checked out — only PR metadata is read
types: [opened, reopened, ready_for_review]

Comment thread .github/scripts/triage_with_llm.py Outdated
Co-authored-by: Yassin Kortam <yassin@berri.ai>
# The Python script calls the LLM whenever this var is set
# (regardless of `--close`); stripping `--close` doesn't suppress
# the API call, only the destructive side effects.
OPENAI_API_KEY: ${{ (vars.AGENT_SHIN_ENABLED == 'true' || github.event_name == 'workflow_dispatch') && secrets.OPENAI_API_KEY || '' }}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: LLM spend on automatic dry-run events

When AGENT_SHIN_ENABLED is true, automatic issues events receive OPENAI_API_KEY, then lines 84-96 force dry-run but still execute the Python script; that script calls the LLM whenever the key is set. An external GitHub user can open or reopen issues to trigger paid LLM calls without any bot action. The same pattern exists in .github/workflows/triage_pr_with_llm.yml; keep the key empty for automatic dry-run triggers, or only call the LLM on workflows that can actually act.

Suggested change
OPENAI_API_KEY: ${{ (vars.AGENT_SHIN_ENABLED == 'true' || github.event_name == 'workflow_dispatch') && secrets.OPENAI_API_KEY || '' }}
OPENAI_API_KEY: ${{ github.event_name == 'workflow_dispatch' && secrets.OPENAI_API_KEY || '' }}

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

There are 5 total unresolved issues (including 4 from previous reviews).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit dd02b76. Configure here.

"action": "reopened",
"comment": reopen_body,
}
return base

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconsider reopens PRs the review gate immediately re-closes

Medium Severity

The triage function's linked-issue short-circuit in reconsider mode (has_linked_issue → pass → reopen) never checks the Greptile confidence score, while review_gate requires both a rubric pass and Greptile ≥ 4/5. When the review gate closes a PR because Greptile scored it below the threshold (even though it has a linked issue), the reconsider flow reopens it based solely on the linked issue. The reopened event then re-triggers the review gate, which immediately closes the PR again — producing contradictory comments and a frustrating reopen-close loop bounded only by the 10-minute rate limiter.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit dd02b76. Configure here.

…you got right' section, softer 'park this for later' framing

User feedback on the auto-triage comments contributors will see:

1. Tone — the previous 'You have 1 day to address this before this PR is
   auto-closed' framing reads as an ultimatum. Replace with: 'If the
   description isn't updated in the next 1 day, I'll auto-close this PR.
   That's not us saying we don't care about the change — we want the
   open-PR list to mirror what a maintainer can act on right now, so
   contributors don't get lost in a backlog. A closed PR is a soft "park
   this for later," not a rejection. Take your time.'

2. Positive feedback — the previous comments only listed what was missing.
   Now every close + grace-warning comment opens with a 'What you got
   right:' section rendered from the judge's per-field flags. Contributors
   see a checkmark for everything they got right (linked issue, problem
   description, expected/actual, QA proof for PRs; runnable repro,
   screenshot/log, expected/actual, motivation+example for issues) before
   the gaps. The block is omitted entirely when nothing is present so
   we never render 'What you got right: (nothing).'

3. Reconsider trigger — the previous grace warning told contributors to
   comment '@agent-shin reconsider' during the grace window. They don't
   need to — the bot re-checks on every sweep. The new copy says 'just
   update the description, no need to ping me' for the grace path, and
   reserves '@agent-shin reconsider' for the post-close recovery path.

4. Bullet-train emoji — replace 👋 with 🚄 (Shinkansen, the symbol of
   Agent Shin) across every action-required comment: PR close, PR grace
   warning, issue close, issue grace warning, within-grace, Greptile-
   closer grace warning, rollout heads-up. Pinned in tests so a future
   refactor can't silently revert.

5. Greptile-post-close — the @greptileai bullet now explicitly says 'a
   low Greptile score isn't a blocker either,' since the previous copy
   buried the fact that @greptileai works after auto-close.

Comment templates updated: format_pr_close_comment,
format_issue_close_comment, format_grace_warning_pr_comment,
format_grace_warning_issue_comment, format_within_grace_comment
(triage_with_llm.py); format_grace_warning_comment
(close_low_quality_prs.py); format_heads_up_comment header
(triage_rollout_heads_up.py).

New helpers: _format_present_for_pr / _format_present_for_issue /
_format_present_block, driven off the existing per-field flags the
LLM judge already emits — no prompt change needed.

New tests pin: bullet-train emoji in every action-required comment;
'What you got right' appears with ✅ bullets when fields are present;
the block is omitted when no fields are present; 'park this for
later' / 'not a rejection' softer framing; grace warnings tell the
contributor 'no need to ping' during the grace window (reconsider is
the post-close path only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mateo-berri

Copy link
Copy Markdown
Collaborator Author

Superseded by #30433 — recreated on the litellm_-prefixed branch litellm_agent_shin_review_gate with a clearer title and description. Same content, same base (litellm_internal_staging).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants