[litellm-agent] Staging → litellm_internal_staging (5/17/2026)#28108
[litellm-agent] Staging → litellm_internal_staging (5/17/2026)#28108oss-pr-review-agent-shin[bot] wants to merge 3 commits into
Conversation
[Infra] Promote internal staging to main
[Infra] Promote internal staging to main
|
@greptile please review |
|
|
Greptile SummaryThis PR fixes issue #28084 by removing a spurious
Confidence Score: 5/5Safe to merge — the change removes dead code that was incorrectly gating a network-only path, and the new tests mock all I/O correctly with no real network calls. The diff is small and surgical: it deletes an import check that was never needed by the actual token-counting implementation and adds two well-structured regression tests that fully stub the auth and HTTP layers. The handler itself is unchanged, the rest of No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/llms/vertex_ai/vertex_ai_partner_models/main.py | Removes the now-unnecessary vertexai (Gemini SDK) import gate from count_tokens; replaces it with an explanatory comment. The actual token-counting path uses VertexAIPartnerModelsTokenCounter over plain httpx and never needed that import. |
| tests/test_litellm/llms/vertex_ai/vertex_ai_partner_models/count_tokens/test_count_tokens_no_vertexai_sdk.py | New regression test file with two fully-mocked tests: one verifies count_tokens proceeds past the old import gate when vertexai is unimportable; the other asserts the handler module itself never loads the Gemini SDK. All network calls are stubbed. |
Reviews (1): Last reviewed commit: "fix(vertex_ai/partner_models): drop unus..." | Re-trigger Greptile
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Adds a new triage flow that evaluates external pull requests and issues against the project's contribution rubric and, when configured to do so, auto-closes non-conforming ones with an explanatory comment. Contributors can update + reopen to be re-evaluated. Scope: - Internal BerriAI contributors (author_association OWNER/MEMBER/COLLABORATOR) and bot accounts are skipped entirely. - 'Fixes #1234' / 'Resolves https://github.com/.../issues/N' in the PR body short-circuits to PASS without burning LLM tokens. - LLM judge returns structured JSON (verdict, missing[], explanation); parser tolerates markdown fences and embedded JSON. - LLM errors NEVER close PRs/issues — failure surfaces as 'skip-llm-error'. Safety: - pull_request_target / issues triggers are FORCED dry-run in the workflow; only manual workflow_dispatch with close=true (and AGENT_SHIN_ENABLED=true) takes destructive action. - Default mode writes verdicts to GITHUB_STEP_SUMMARY only — no public comments until the team flips the AGENT_SHIN_ENABLED repo variable. - LLM uses an OpenAI-compatible endpoint (model and base URL configurable via repo variables; key via OPENAI_API_KEY secret). Files: - .github/scripts/triage_with_llm.py - judge orchestrator + CLI - .github/workflows/triage_pr_with_llm.yml - .github/workflows/triage_issue_with_llm.yml - tests/test_litellm/test_github_triage_with_llm.py - 33 unit tests End-to-end validated against four real PRs (#28117 internal collaborator, #28108 bot, #28129 'Fixes #28128', #28116 no linked issue) and issue #28132 with a stubbed LLM judge: each path produces the expected action. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com>
…and review-gate label lifecycle (#30433) * feat(triage): auto-close stale PRs with Greptile score <4/5 Adds .github/scripts/close_low_quality_prs.py and a daily workflow that closes PRs which: - are open for at least 7 days, and - carry a most-recent greptile-apps review with Confidence Score <4/5, - and are not drafts or opt-out-labeled ('do not close', 'wip', etc.). Each closure posts an explanatory comment telling the contributor how to bring the PR back (rebase, re-request greptile, reopen at 4+/5). The 4/5 bar is already documented in the PR template (.github/pull_request_template.md), so this just enforces it. Tested with a dry run against the live BerriAI/litellm backlog of 1000 open PRs: 100 candidates identified, 598 PRs pass the bar (4+/5), 186 are too young, 97 are drafts, 19 lack any Greptile review and are left alone. Workflow defaults to closing 25 PRs/run as a safety net and supports workflow_dispatch with overrides (close=false for a dry run, custom min_age_days/min_score/limit). 18 unit tests cover score extraction (HTML/markdown/plain text, login variants, multi-review picks latest) and per-PR evaluation (drafts, opt-out labels, age, missing/passing/failing scores). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * docs(templates): require expected/actual + QA proof for external contributions PR template: - Make the rubric explicit at the top: link an issue, OR provide a clear problem description + expected vs. actual + visual QA proof. - Add dedicated sections for each piece so the bot has a deterministic shape to read. - Keep the existing 'Linear ticket' section for internal contributors (they're exempt from the auto-triage rubric). Bug report template: - Split 'What happened?' into 'Actual behavior' + 'Expected behavior'. - Make logs/screenshot a required textarea. - Warning banner at the top tells external contributors that incomplete reports will be auto-closed (with re-evaluation on reopen). Feature request template: - Require a concrete use case + example in the motivation field, not just a one-liner pitch. - Same auto-triage warning banner. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(triage): Agent Shin LLM-as-judge for external PRs and issues Adds a new triage flow that evaluates external pull requests and issues against the project's contribution rubric and, when configured to do so, auto-closes non-conforming ones with an explanatory comment. Contributors can update + reopen to be re-evaluated. Scope: - Internal BerriAI contributors (author_association OWNER/MEMBER/COLLABORATOR) and bot accounts are skipped entirely. - 'Fixes #1234' / 'Resolves https://github.com/.../issues/N' in the PR body short-circuits to PASS without burning LLM tokens. - LLM judge returns structured JSON (verdict, missing[], explanation); parser tolerates markdown fences and embedded JSON. - LLM errors NEVER close PRs/issues — failure surfaces as 'skip-llm-error'. Safety: - pull_request_target / issues triggers are FORCED dry-run in the workflow; only manual workflow_dispatch with close=true (and AGENT_SHIN_ENABLED=true) takes destructive action. - Default mode writes verdicts to GITHUB_STEP_SUMMARY only — no public comments until the team flips the AGENT_SHIN_ENABLED repo variable. - LLM uses an OpenAI-compatible endpoint (model and base URL configurable via repo variables; key via OPENAI_API_KEY secret). Files: - .github/scripts/triage_with_llm.py - judge orchestrator + CLI - .github/workflows/triage_pr_with_llm.yml - .github/workflows/triage_issue_with_llm.yml - tests/test_litellm/test_github_triage_with_llm.py - 33 unit tests End-to-end validated against four real PRs (#28117 internal collaborator, #28108 bot, #28129 'Fixes #28128', #28116 no linked issue) and issue #28132 with a stubbed LLM judge: each path produces the expected action. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(triage): scope Greptile auto-closer to external contributors + dry-run by default - close_low_quality_prs.py now filters by GitHub author_association via the REST API: PRs from OWNER / MEMBER / COLLABORATOR (and bot accounts) are skipped with a new 'skip-internal' summary bucket. - close_low_quality_prs.yml now defaults workflow_dispatch close=false, and ignores 'close=true' unless the new repo variable AGENT_SHIN_ENABLED is set to 'true'. Scheduled runs are dry-run only until the team flips that switch. - Updated unit tests: one new test asserting internal authors are skipped, and an autouse fixture treats unspecified test PRs as external so the rest of the suite still exercises the close path. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(workflows): scheduled cron closes PRs; safe --close strip in triage Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(triage): scheduled cron stays dry-run; dedent prompts before interpolation - close_low_quality_prs.yml: only workflow_dispatch with close=true (and AGENT_SHIN_ENABLED=true) actually closes PRs. Scheduled runs are always dry-run, matching the safety invariant documented for triage_pr/issue. - triage_with_llm.py: textwrap.dedent on an f-string with multi-line interpolated bodies fails because the body's 2nd+ lines start at column 0, making the common-indent zero. Dedent the static template first, then .format() the title/body in. Co-authored-by: Yassin Kortam <yassin@berri.ai> * Fix bugs in auto-close PR triage scripts - close_low_quality_prs.py: Treat author_association API lookup failures as internal (fail-safe) so transient errors don't cause internal contributors' PRs to be auto-closed. - triage_with_llm.py: Update summary heading from 'Would post comment:' to 'Posted comment:' since this branch only runs after the comment has already been posted. Co-authored-by: Yassin Kortam <yassin@berri.ai> * feat(triage): default Agent Shin to gpt-5.4-mini with reasoning_effort=none - Bump DEFAULT_MODEL from gpt-4o-mini to gpt-5.4-mini (more modern; 4M total context window per OpenAI catalog, JSON-schema response format, function calling all supported). - For gpt-5.x family models, pass reasoning_effort="none" via extra_body. gpt-5.x rejects temperature != 1 unless reasoning_effort is explicitly "none"; setting it lets us keep temperature=0 for deterministic JSON rubric judgments. extra_body works across openai SDK versions regardless of whether they natively type the kwarg. - For non-gpt5 overrides (TRIAGE_MODEL=gpt-4o-mini etc.), reasoning_effort is not sent. - 4 new unit tests cover: gpt-5.4-mini -> reasoning_effort=none, capitalized/dated gpt-5 variants -> reasoning_effort=none, gpt-4o-mini -> no extra_body, base_url passthrough. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(triage): bugbot — drop dead gh_json and fix --optout-label append-with-default - Removed the unused gh_json helper (bugbot low-severity dead code). - Replaced argparse `action="append", default=[...]` with default=None + DEFAULT_OPTOUT_LABELS fallback. The mutable-default + append combo silently APPENDS to the canonical defaults instead of replacing them, so --optout-label could not actually scope the opt-out list. - Added tests covering both the canonical default and the flag-replaces-defaults behavior. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(triage): bugbot — tighten linked-issue regex, fail-safe author_association, fix empty TRIAGE_MODEL Three independent bugbot findings against triage_with_llm.py: 1. LINKED_ISSUE_PATTERN included weak keywords (`see`, `ref`, `addresses`) so casual mentions like "See #1234 for context" were short-circuited to pass-linked-issue without ever calling the LLM — contradicting the prompt's own "a bare issue number without a closing keyword counts only if it's clearly the related issue (not a passing mention)" rubric. Limit the regex to GitHub's documented PR-closing keywords (fixes/fix/fixed/closes/close/closed/resolves/resolve/resolved). 2. is_internal_contributor() treated an empty/missing author_association as external (eligible for the destructive close path), while the sibling is_external_pr_author() in close_low_quality_prs.py fail-safes the same case as internal. Align the two so a partial/unknown GitHub response can never make a PR eligible for auto-close. 3. argparse `default=os.environ.get("TRIAGE_MODEL", DEFAULT_MODEL)` returns the empty string when GitHub Actions exposes an unset repo variable as an empty-string env var (the optional vars.TRIAGE_MODEL case in the workflow). Use `os.environ.get(...) or DEFAULT_MODEL` so empty -> default, matching the existing OPENAI_BASE_URL pattern. Tests: - Casual mentions now must fall through to the LLM (parametrized); added an orchestration test ensuring "See #1234" reaches the judge. - Empty/missing author_association now fails safe (parametrized). - Empty TRIAGE_MODEL env var falls back to DEFAULT_MODEL; explicit TRIAGE_MODEL is still honored. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(workflows): bugbot — gate Agent Shin --close on '= true' not '!= false' The PR and issue Agent Shin workflows gated the destructive --close flag with [ "${DISPATCH_CLOSE:-false}" != "false" ]. That pattern treats anything other than the literal string "false" as enabling closure — "True", "yes", "1", typos, accidental whitespace, etc. The workflow_dispatch input UI is a 'true'/'false' choice dropdown so the form is constrained, but the API (`gh workflow run -f close=...`) accepts any string, and a CI cron / external invoker passing a non-canonical truthy value would have silently enabled real contributor PR closures. Mirror the sibling Greptile closer's [ "${CLOSE_FLAG}" = "true" ] pattern: only the EXACT string "true" enables --close; every other value (including the unset/empty default) resolves to dry-run. This is the fail-safe philosophy applied everywhere else in this PR. Added tests/test_litellm/test_github_triage_workflows.py with two parametrized invariants: 1. The destructive gate uses '= "true"' for its env-var comparison (either bare '${ENV}' or '${ENV:-false}' form accepted), and never the fail-open '!= "false"' pattern. 2. Every destructive gate is also gated on AGENT_SHIN_ENABLED being "true" — either by entering the close branch on '=' or by bailing out early on '!=' — so flipping the repo variable off is a true kill switch regardless of per-run inputs. Manually verified the test fails on the buggy '!= "false"' pattern and passes on the fix, so it would have caught the regression at PR time. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(triage): close any PR (incl. drafts, any age); add @agent-shin reconsider flow Follow-up to PR #28117. Three behavior changes + one new workflow, addressing the team's concerns on the original review: 1) Apply auto-close to ALL open PRs, not just those over a week old. - close_low_quality_prs.py: --min-age-days default flipped from 7 to 0. The flag is preserved as an opt-in safety net for one-off backfill runs that want to spare very-young PRs, but the daily scheduled sweep now closes external-author PRs as soon as Greptile scores them <4/5. - close_low_quality_prs.yml: workflow_dispatch input default also flipped to 0; doc comments updated. 2) Apply auto-close to draft PRs too. - close_low_quality_prs.py: removed the skip-draft branch in evaluate_pr. Drafts are NOT a free pass — the team's intent is 'open PR count == PRs internal collaborators need to action on', so a draft Greptile scored 2/5 still belongs in the closed bucket. Authors who genuinely need a long-lived draft can attach the 'wip' opt-out label, which is unchanged. - The 'skip-draft' action is gone; the 'wip' label still skips. 3) Address the 'OSS contributors cannot reopen a bot-closed PR' wrinkle. GitHub does NOT let an external (non-write-access) contributor reopen a PR that was closed by a bot or maintainer (long-standing limitation). The original PR's close-comments told contributors to 'Reopen the PR — I'll re-evaluate automatically', which is broken for the very audience this triage targets. Two changes: a) Reword every close-comment (Greptile sweep + Agent Shin PR close + Agent Shin issue close + PR template) to recommend: - Open a new PR with the updated branch (primary path). - Or comment '@agent-shin reconsider' on the closed PR for a re-evaluation that, on pass, reopens the PR via the bot's GH_TOKEN write access. b) Add the @agent-shin reconsider workflow: - .github/workflows/triage_reconsider.yml: new 'issue_comment'-triggered workflow. Authorizes only the PR/issue author or an internal collaborator (OWNER/MEMBER/COLLABORATOR), gated via a step output so unauthorized commenters never reach the destructive steps. Globally gated on AGENT_SHIN_ENABLED='true' (positive form, matching the test_github_triage_workflows guardrail patterns). - triage_with_llm.py: --reconsider mode. On a closed PR/issue, re-runs the LLM judge (or linked-issue regex short-circuit) and: - on pass: reopens via reopen_pr/reopen_issue + posts a 'Re-evaluated and reopened' comment. - on fail: leaves closed and posts a 'still missing X' comment so the contributor can iterate again. Reconsider-on-open is a no-op ('skip-not-closed'). Internal-author + bot-account skips still take priority over reconsider. 4) Greptile-on-closed-PRs question: the team asked whether Greptile can re-review a closed PR. Greptile's docs don't address this and we shouldn't promise behavior we can't verify, so the new close-comment wording does NOT instruct contributors to 're-request greptile on the closed PR'. Instead it points them at the new-PR path (which Greptile definitely reviews) or the @agent-shin reconsider trigger (which re-runs the LiteLLM-side rubric judge, not Greptile). Tests: 93 passing (was 59). - test_github_close_low_quality_prs.py: replaced 'skip drafts' test with 'closes drafts when score is low' + 'closes brand-new PR when min_age=0' + 'no skip when min_age=0'. The 'skip too young' assertion is preserved as opt-in. - test_github_triage_with_llm.py: 6 new TestTriageOrchestration cases for reconsider mode (skip-not-closed on open, reopen on pass, still-failing comment on fail, linked-issue short-circuit reopen, skip internal author in reconsider, reopen-issue on pass) + a new TestCloseCommentText class that pins the user-facing 'open a new PR' + '@agent-shin reconsider' wording. - test_github_triage_workflows.py: added triage_reconsider.yml to the destructive-gate guardrail table; AGENT_SHIN_ENABLED is its own destructive gate (no separate per-run flag needed). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * test(triage): pin safe behavior for curly braces in PR/issue title+body Adds regression tests covering the bugbot high-severity finding that str.format() would crash on user-supplied content containing { or }. Empirically str.format() does NOT re-parse interpolated values — only the template literal is scanned for replacement fields — so the bug does not exist in the current code, but pinning the safe behavior prevents a future templating change from silently reintroducing it. Also pins the dedented prompt shape (no leading 8-space indentation on template lines) so a future change to the build_*_prompt functions can't silently regress the LLM judge prompt format on multi-line bodies. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix(triage): bugbot — reconsider dry-run + bot-closed guard + rate limit Address three Greptile/veria-ai concerns on the @agent-shin reconsider flow: 1. **Reconsider had no dry-run path.** The previous reconsider mode ignored `--close` and always posted comments + reopened on a pass. A local operator running `python triage_with_llm.py --reconsider --pr N` would silently take destructive GitHub actions with no way to preview. Reconsider now honors `close=False` the same way regular triage does and returns `would-reopen` / `would-reconsider-still-failing` for step-summary rendering. 2. **Reconsider could reopen maintainer-closed PRs/issues** (Medium security finding from veria-ai). The workflow only checked that the commenter was authorized — it did NOT check that the most recent close was performed by Agent Shin. A contributor could comment `@agent-shin reconsider` on a PR a maintainer closed for non-rubric reasons (duplicate, security report, design rejection) and have the bot reopen it. Add `was_closed_by_agent_shin()` which inspects the issue events API for the most recent `closed` actor and only permits reopen when that actor matches the configured bot login (default `github-actions[bot]`, overridable via env). Fail-closed on missing events. 3. **No rate-limiting on the reconsider trigger.** Every `@agent-shin reconsider` comment burns CI minutes + an OpenAI API call. Add a 10-minute cooldown via `seconds_since_last_reconsider_verdict()` which greps the issue's comment list for the bot's own verdict marker (`<!-- agent-shin:reconsider-verdict -->`). Inside the window the triage returns `skip-rate-limited` and the LLM never runs. Workflow update: - `triage_reconsider.yml` now passes `--close` only when `AGENT_SHIN_ENABLED=true`, matching the pattern of `triage_pr_with_llm.yml`. The script runs in both states so the verdict still appears in the step summary for QA. Tests: - Add 5 reconsider safety tests: dry-run for pass / fail / linked-issue short-circuit, bot-closed-guard refusal on maintainer close, rate-limit refusal inside the cooldown window, and cooldown-elapsed acceptance. - Add unit tests for `was_closed_by_agent_shin` (bot / maintainer / missing actor / env-override) and `seconds_since_last_reconsider_verdict` (no marker / multiple markers / non-bot comment with marker / bot comment without marker). - Pin the `<!-- agent-shin:reconsider-verdict -->` marker in both reopen and still-failing comments — dropping it would silently break the cooldown. Existing reconsider tests updated to pass `close=True` (the production path now) + stub the new guards via `_stub_reconsider_guards`. 112 tests pass (was 93). Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * feat(triage): 1-day grace period before close + SwiftWinds immediate-close bypass - Add a 24-hour grace window between the first low-quality detection and the actual auto-close. The first detection posts a warning comment that explicitly says "You have 1 day to address this before this PR is auto-closed" and points the contributor at: * `@agent-shin reconsider` to request another look (and re-open) * `@greptileai` to request a fresh Greptile review — works even after the PR is closed - Both `triage_with_llm.py` (LLM judge) and `close_low_quality_prs.py` (Greptile-score closer) share the same `<!-- agent-shin:grace-warning -->` HTML marker so a warning posted by either path is recognized by both. - Add IMMEDIATE_CLOSE_LOGINS = {swiftwinds} to bypass BOTH the grace period AND the dry-run / AGENT_SHIN_ENABLED gating. SwiftWinds is the user's personal account (no push permissions to litellm) used to dogfood the bot; user explicitly asked: "For SwiftWinds, just close immediately. Faster iteration that way." - Update the standard close comments to mention that `@greptileai` works even after the PR is closed. - Add 23 new tests covering: warn-grace on first detection, skip during grace window, close after grace expires, SwiftWinds bypass (case insensitive, with close=False, no random-login false positives), the grace-warning text invariants, and the SwiftWinds entry in the IMMEDIATE_CLOSE_LOGINS constant. Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> * fix: skip grace-period text in close comment for IMMEDIATE_CLOSE_LOGINS For PRs from IMMEDIATE_CLOSE_LOGINS (e.g. swiftwinds), evaluate_pr returns 'close' immediately without ever posting a grace warning, so the close comment should not reference a 1-day grace period. Make close_pr take a grace_period_elapsed flag, default True, and pass False from the main loop when the close path was the immediate-close branch. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(close-low-quality-prs): report actual closes in dry-run summary IMMEDIATE_CLOSE_LOGINS PRs are closed even when the global --close flag is not set, but the summary used the global dry-run flag to choose between 'would close' and 'closed'. Split the count so operators can see both actual closures and dry-run would-be closures. Co-authored-by: Yassin Kortam <yassin@berri.ai> * chore(triage): vendor Agent Shin (#28117) onto demo branch Brings the Agent Shin OSS-triage scripts, workflows, issue/PR templates, and tests from PR #28117 onto this branch so the new review-gate feature and its end-to-end demo are self-contained and runnable in CI. https://claude.ai/code/session_01XyyWa8t2VYmoGd6mKMEqkZ * feat(triage): add "ready for review" label lifecycle to Agent Shin Adds review_gate(), a state machine that keeps a `ready for review` label in sync with whether an external PR clears BOTH gates — the LLM rubric and Greptile's most recent confidence score: - pass (untagged) -> add label + "ready for review" / "all clear" comment - pass (already tagged) -> no-op (idempotent across re-runs) - regress (Greptile < 4/5 or QA proof removed) -> remove label + "what's missing" comment, PR stays open - recover after a regression -> "all clear again" comment + re-add the label - fail & untagged, < 24h old -> one-time "what's missing" notice (grace window) - fail & untagged, > 24h old -> close + comment (reopen via @agent-shin reconsider) The label itself is the persisted state, so comments fire only on transitions (never on every scheduled run). All side effects are gated behind --close, so the dry-run contract matches the existing triage flow. Lifecycle comments use hidden HTML markers and deliberately avoid the auto-close marker so they never trip the reconsider provenance check. Relocates the shared Greptile helpers (extract_greptile_score, SCORE_PATTERN, GREPTILE_BOT_LOGINS, parse_iso8601) into triage_with_llm.py so the daily sweep and the review gate read the score through one implementation, and adds the review_gate.yml workflow (dry-run unless AGENT_SHIN_ENABLED=true) plus 18 unit tests covering every branch and a full pass->regress->recover cycle. https://claude.ai/code/session_01XyyWa8t2VYmoGd6mKMEqkZ * Port review-gate feature from #28758 onto #28147 triage scripts Adds the "ready for review" label lifecycle (originally PR #28758) on top of #28147's refactored triage_with_llm.py. The original commit was authored against an older snapshot of #28117 and could not be applied cleanly, so the additions were re-applied surgically: - New constants: READY_FOR_REVIEW_LABEL, DEFAULT_GRACE_DAYS, DEFAULT_MIN_GREPTILE_SCORE, READY/REGRESSED/WITHIN_GRACE markers, GREPTILE_BOT_LOGINS, SCORE_PATTERN, AGENT_SHIN_AUTO_CLOSE_MARKER. - New helpers: add_label, remove_label, extract_greptile_score, parse_iso8601 (the latter two mirrored from close_low_quality_prs.py so the daily sweep and the review gate read the score through the same logic). - New comment formatters: format_ready_for_review_comment, format_all_clear_comment, format_regression_comment, format_within_grace_comment. - New entry point: review_gate() implementing the pass/regress/recover state machine, with the label itself acting as persisted state so transition comments fire only on actual transitions. - main() learns --review-gate, --grace-days, --min-greptile-score and dispatches to review_gate() when the flag is set. Verified via tests/test_litellm/test_github_review_gate.py (18 tests) and the existing triage suites (144 more) — all 162 pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * agent_shin: extract shared constants/helpers; cover review_gate.yml in guardrail tests Bug 1: `triage_with_llm.py` and `close_low_quality_prs.py` each defined their own copies of `extract_greptile_score`, `parse_iso8601`, `GREPTILE_BOT_LOGINS`, `SCORE_PATTERN`, `GRACE_COMMENT_MARKER`, `GRACE_PERIOD_SECONDS`, `IMMEDIATE_CLOSE_LOGINS`, and `AGENT_SHIN_DEFAULT_BOT_LOGIN`. The comments explicitly said the two copies had to stay in sync, but nothing enforced it. A future change to one (e.g. extending `SCORE_PATTERN` for a new Greptile output format) would silently diverge from the other and the daily sweep and the LLM judge would disagree on which PRs have low scores. Extract these to `.github/scripts/agent_shin_shared.py` and re-export them from each script so the existing test attribute access (`triage_module.GRACE_COMMENT_MARKER`, etc.) keeps working without any test changes. Bug 2: `review_gate.yml` is a destructive workflow (close PRs, add/remove labels, post comments) with the same gating philosophy as the others (`AGENT_SHIN_ENABLED = "true"` + a per-run `CLOSE_FLAG = "true"`), but it was missing from `DESTRUCTIVE_GATE_ENV` in the guardrail tests. Add it so a future regression (e.g. flipping to `!= "false"`) is caught by the same parameterized invariants as every other workflow. Co-authored-by: Yassin Kortam <yassin@berri.ai> * agent_shin: fix bug bundle (gated LLM key, author-filtered marker dedup, dedup gh/grace helpers) Co-authored-by: Yassin Kortam <yassin@berri.ai> * agent_shin: fix review_gate close-after-regression and case-insensitive label match Co-authored-by: Yassin Kortam <yassin@berri.ai> * feat(triage): add one-shot 7-day heads-up sweep for Agent Shin rollout Adds a rollout-day workflow that comments on every open external PR/issue that the new triage bot WOULD auto-close, giving contributors 7 days to fix their description before any destructive action runs. Why now: merging this PR enables Agent Shin in dry-run. The follow-up "enact" PR (next Monday) flips the destructive paths on. Without this heads-up, contributors would get a close-comment on day 8 with no prior warning. The heads-up names the cutoff date, lists the rubric, calls out each PR/issue's specific missing pieces, and explains the recovery paths (@agent-shin reconsider for PRs, edit + reopen for issues). Files - .github/scripts/_agent_shin_actions.py — thin maybe_post_comment / maybe_close_* / maybe_add_label / etc. wrappers. Each is a single `if dry_run: log; return; else: call_through()` so a dry-run preview differs from the real run in exactly one call site per mutation. The call-through goes via `triage_with_llm.<name>` (module-qualified) so monkeypatching the underlying function in tests is reflected here. - .github/scripts/triage_rollout_heads_up.py — the sweep. Iterates every open PR + issue via `gh pr list` / `gh issue list`, runs the future rubric (review_gate for PRs, triage(kind="issue") for issues), and posts the heads-up on any item that would be auto-closed. Idempotent via a `<!-- agent-shin:rollout-heads-up -->` marker. Defaults to dry- run; --close opts in to real posts. --close-on overrides the cutoff date (defaults to today + 7 days). - .github/workflows/triage_rollout_heads_up.yml — one-shot workflow. Triggers on push to litellm_internal_staging filtered to the script path (fires on rollout merge) plus workflow_dispatch with a dry_run input that defaults to "true" for safe manual re-runs. - tests/test_litellm/test_triage_rollout_heads_up.py — 28 unit tests covering: the dry-run wrappers (each maybe_* gates correctly), the _would_be_closed predicate for PR vs. issue results, the comment formatter (cutoff/rubric/marker/recovery wording), per-item dispatch (skip-not-open, skip-internal-author, skip-already-notified, skip-passing, would-post/posted), and the sweep loop end-to-end. Local preview (no GitHub mutations): python3 .github/scripts/triage_rollout_heads_up.py --repo BerriAI/litellm Real run (what the workflow does): python3 .github/scripts/triage_rollout_heads_up.py --repo BerriAI/litellm --close TODO: replace the placeholder ROLLOUT_BLOG_URL with the canonical docs URL once the litellm-docs PR ships. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix: gate reconsider workflow OPENAI_API_KEY + remove dead actions wrappers - Mirror sibling Agent Shin workflows by only exposing OPENAI_API_KEY in triage_reconsider.yml when vars.AGENT_SHIN_ENABLED == 'true'. Previously the secret was unconditionally exposed, so any PR/issue author could trigger paid LLM calls by commenting '@agent-shin reconsider' even while the bot was supposed to be in dry-run. - Remove the six unused dry-run wrappers (maybe_close_pr, maybe_close_issue, maybe_reopen_pr, maybe_reopen_issue, maybe_add_label, maybe_remove_label) from _agent_shin_actions.py — only maybe_post_comment is used by rollout scripts. Drop the associated tests that exercised the now-removed functions. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix: address triage script edge cases - triage_rollout_heads_up.py: replace %-d strftime specifier (GNU-only) with portable day formatting so the script doesn't crash on Windows. - close_low_quality_prs.py: skip malformed JSON lines in fetch_pr_comments instead of letting one bad line abort the daily sweep, matching the pattern in triage_with_llm._iter_paginated_json. - triage_with_llm.py: move has_linked_issue short-circuit before build_pr_prompt to avoid unnecessary prompt construction on PRs that link an issue. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(scripts): per-PR error isolation and limit grace warnings in close_low_quality_prs - Wrap per-PR processing in try/except so a transient GitHub API failure on one PR no longer aborts the entire daily sweep (mirrors the pattern already used in triage_rollout_heads_up.py). - Have --limit bound *all* destructive write actions (closures and grace warnings combined), not just closures. Prevents a backlog of newly failing PRs from flooding contributors with comments in a single run. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix(agent-shin): remove 1000-PR cap on bulk sweeps; sweep entire backlog Both bulk-sweep scripts hardcoded `gh {pr,issue} list --limit 1000`, and gh lists newest-first — so the OLDEST ~900 PRs and ~380 issues were silently dropped. That's exactly the stale backlog the daily closer and one-shot rollout heads-up exist to catch. Extract a single `list_open_items(kind, *, repo, fields)` helper into `agent_shin_shared.py` with `GH_LIST_ALL_LIMIT = 100_000` — a ceiling far above any realistic open backlog so gh paginates until the queue is exhausted. `fetch_open_prs` and `_list_open_numbers` both delegate to it, so the limit lives in exactly one place going forward. Verified live against BerriAI/litellm: - `fetch_open_prs` -> 1981 PRs (was 1000) - `_list_open_numbers(issue)` -> 1382 issues (was 1000) - `_list_open_numbers(pr)` -> 1981 PRs (was 1000) Adds 7 regression tests asserting the new limit is passed, the dedicated `gh {pr,issue} list` command + fields are used per kind, bad kind raises ValueError, and both callers delegate to the shared helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(agent-shin): require non-mocked end-to-end QA proof for PR pass The PR rubric previously passed any PR with a linked issue, regardless of whether it showed the fix actually working. Sample spot-check found 21/25 recent external PRs passing, including ones that linked an issue but provided zero QA evidence. Tighten the rubric so a pass now requires BOTH: (1) CONTEXT — a linked issue OR a clear problem description with expected-vs-actual behavior. (2) END-TO-END QA PROOF — at least one of: (a) screenshot(s) of the fix working, (b) screen recording / video, (c) specific commands actually run, paired with their real output, against the real system. Mocked unit tests, generic 'I tested it' claims, 'all tests pass' without output, and the linked issue itself are explicitly excluded from QA proof. Also add 'qa_proof_type' to the JSON schema so the per-PR report surfaces which kind of proof (or 'none') the judge saw. Re-sample on the same 25 recent external PRs shifts the verdict distribution from 21 pass / 4 fail to 4 pass / 21 fail, with zero prior-fails now passing — the stricter rule catches PRs that ship only with unit-test claims and no real integration evidence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(agent-shin): link blog explainer from every action-required bot comment Adds "What's this and why am I getting it?" links to docs.litellm.ai/blog/ agent-shin-triage from the four comments contributors actually read when something went wrong: PR close, PR grace warning, issue close, issue grace warning. PR comments also link the rubric section directly from the QA-proof bullet so contributors can self-serve "what counts as proof" without pinging a maintainer. Pins the new guarantees in tests: blog link must appear in all four comments, and the PR close comment must continue to flag mocked-dependency unit tests as insufficient proof. The linked blog post is in BerriAI/litellm-docs PR #240; the URL will 404 until that lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(review_gate): raise sweep limit from 1000 to 100000 to match GH_LIST_ALL_LIMIT gh lists newest-first, so capping at 1000 silently drops the oldest open PRs — exactly the stale ones the daily sweep is meant to reconcile. Use the same ceiling as agent_shin_shared.GH_LIST_ALL_LIMIT so the workflow sees the entire backlog. Co-authored-by: Yassin Kortam <yassin@berri.ai> * Fix three Agent Shin triage edge cases - review_gate: expire the regression-marker short-circuit after grace_days so PRs that were regressed and then abandoned can eventually be closed. - review_gate: when the rubric short-circuits to pass via the linked-issue regex but Greptile drags the PR below the bar, replace the synthetic 'LLM was not called' explanation with the real Greptile shortfall so regression / close comments are not misleading. - triage_rollout_heads_up._comments_have_marker: drop the unused 'kind' parameter and filter by bot author so a contributor quoting the heads-up via 'Quote reply' cannot trick the idempotency check, matching the pattern in triage_with_llm._has_marker. Co-authored-by: Yassin Kortam <yassin@berri.ai> * fix: pass min_greptile_score through to ready-for-review comment text Co-authored-by: Yassin Kortam <yassin@berri.ai> * feat(agent-shin): warmer triage comments — bullet-train emoji, 'what you got right' section, softer 'park this for later' framing User feedback on the auto-triage comments contributors will see: 1. Tone — the previous 'You have 1 day to address this before this PR is auto-closed' framing reads as an ultimatum. Replace with: 'If the description isn't updated in the next 1 day, I'll auto-close this PR. That's not us saying we don't care about the change — we want the open-PR list to mirror what a maintainer can act on right now, so contributors don't get lost in a backlog. A closed PR is a soft "park this for later," not a rejection. Take your time.' 2. Positive feedback — the previous comments only listed what was missing. Now every close + grace-warning comment opens with a 'What you got right:' section rendered from the judge's per-field flags. Contributors see a checkmark for everything they got right (linked issue, problem description, expected/actual, QA proof for PRs; runnable repro, screenshot/log, expected/actual, motivation+example for issues) before the gaps. The block is omitted entirely when nothing is present so we never render 'What you got right: (nothing).' 3. Reconsider trigger — the previous grace warning told contributors to comment '@agent-shin reconsider' during the grace window. They don't need to — the bot re-checks on every sweep. The new copy says 'just update the description, no need to ping me' for the grace path, and reserves '@agent-shin reconsider' for the post-close recovery path. 4. Bullet-train emoji — replace 👋 with 🚄 (Shinkansen, the symbol of Agent Shin) across every action-required comment: PR close, PR grace warning, issue close, issue grace warning, within-grace, Greptile- closer grace warning, rollout heads-up. Pinned in tests so a future refactor can't silently revert. 5. Greptile-post-close — the @greptileai bullet now explicitly says 'a low Greptile score isn't a blocker either,' since the previous copy buried the fact that @greptileai works after auto-close. Comment templates updated: format_pr_close_comment, format_issue_close_comment, format_grace_warning_pr_comment, format_grace_warning_issue_comment, format_within_grace_comment (triage_with_llm.py); format_grace_warning_comment (close_low_quality_prs.py); format_heads_up_comment header (triage_rollout_heads_up.py). New helpers: _format_present_for_pr / _format_present_for_issue / _format_present_block, driven off the existing per-field flags the LLM judge already emits — no prompt change needed. New tests pin: bullet-train emoji in every action-required comment; 'What you got right' appears with ✅ bullets when fields are present; the block is omitted when no fields are present; 'park this for later' / 'not a rejection' softer framing; grace warnings tell the contributor 'no need to ping' during the grace window (reconsider is the post-close path only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(agent-shin): gate triage on a dogfood allowlist Add ALLOWLIST_LOGINS to agent_shin_shared so Agent Shin only acts on the named accounts while the set is non-empty. mateo-berri and SwiftWinds are allowlisted for the dogfood rollout; everyone else is skipped with skip-not-allowlisted across all four entrypoints (triage, review gate, the daily low-quality sweep, and the rollout heads-up). For an allowlisted author the usual internal/external classification is bypassed, so a maintainer's own org account still gets triaged during testing. Emptying the set lifts the restriction and restores full triage for the public rollout. The gate is dependency-injected via an `allowlist` parameter defaulting to the constant, so the internal/external-skip paths stay testable. * feat(agent-shin): tighten QA-proof and issue rubrics, ack reconsider with reactions Reorder the end-to-end QA proof options to video, then screenshots, then exact commands with their real output across the PR template, the LLM judge prompts, and every contributor-facing comment, and spell out that mocked or stubbed runs (including pytest on the repo's own unit tests, which mock the provider, DB, and network) never count as proof. QA proof is now required of all contributors, not just external ones. Tighten the issue bug-report rubric to require end-to-end evidence of the bug (the "before" half: a video, screenshot, or command paired with real output) plus expected vs. actual behavior, drop the bias toward PASS, and collapse the separate has_repro/has_proof flags into a single has_repro signal. Standardize the bullet-train emoji and strip em dashes from the bot's public-facing messages, and route issue recovery through @agent-shin reconsider since GitHub doesn't let OSS authors reopen an issue a bot closed. Acknowledge an @agent-shin reconsider the moment it's accepted with an eyes reaction and a thumbs-up once the run finishes, both gated on AGENT_SHIN_ENABLED so dry-run leaves no trace. * fix(agent-shin): shorten auto-close grace to 2 hours and drop the instant-close bypass Two dogfooding changes to the Agent Shin grace window. First, the warn-then-close grace (GRACE_PERIOD_SECONDS) drops from a day to 2 hours so the "fix it before it closes" loop can be exercised in one sitting; the constant carries a note to bump it back up for the public rollout. Second, remove IMMEDIATE_CLOSE_LOGINS entirely. SwiftWinds (the external dogfood account) used to skip the grace window and close on first detection, which also meant closing real PRs even during a scheduled dry run because the per-PR override flipped dry_run off. It now follows the same warn-then-close path as every other author, so a low-quality PR is warned first and only closed once the 2-hour window elapses. This also closes the Greptile finding that the sweep could mutate real PRs while AGENT_SHIN_ENABLED was still off. The review gate's separate age-based grace (DEFAULT_GRACE_DAYS) is left unchanged. Regression tests pin that SwiftWinds now warns-grace instead of closing instantly, and that a dry-run sweep over a closeable PR reports "would close" without making any GitHub mutation. * fix(agent-shin): gate reconsider reopen on an Agent Shin close marker was_closed_by_agent_shin only checked that the most recent close actor was the bot identity. That identity defaults to github-actions[bot], which is shared by every workflow in the repo (stale/duplicate sweeps included), so a contributor could @agent-shin reconsider an item another workflow closed and, if the description passed the rubric, get it reopened even though Agent Shin was never the closer. Require a second, Agent-Shin-specific signal alongside the actor check: an auto-close comment stamped with a hidden AGENT_SHIN_CLOSE_MARKER. Both close paths (the grace-period close and the review-gate close) flow through format_pr_close_comment / format_issue_close_comment, so stamping the marker there covers every real close while leaving the grace warnings unmarked. The guard stays fail-closed: no marker, no reopen. This also replaces the unused AGENT_SHIN_AUTO_CLOSE_MARKER constant (a visible phrase the guard never consulted) with the hidden marker the guard now relies on. * fix(agent-shin): stamp close marker on sweep closes and disclose regression deadline The daily Greptile sweep's close comment advertised `@agent-shin reconsider` but never stamped AGENT_SHIN_CLOSE_MARKER, so the reconsider reopen guard (was_closed_by_agent_shin), which now also requires that marker, silently rejected every sweep-closed PR with `skip-not-bot-closed`. Move the marker into agent_shin_shared so both close paths share one source of truth, extract format_close_comment so the sweep close comment is unit-testable, and stamp the marker there. Also disclose the grace_days deadline in the review-gate regression comment; it promised "the PR stays open" without mentioning that a still-failing PR is auto-closed grace_days after the notice, which would surprise contributors with a close they were never warned about. * fix(triage): tighten Agent Shin reconsider reopen guards The bot-closed guard accepted any historical Agent Shin marker comment on the thread as proof that Agent Shin owned the latest close, so a post-reopen close by another workflow under the shared `github-actions[bot]` identity could still satisfy the gate and let `@agent-shin reconsider` reopen a PR that Agent Shin did not close this cycle. `fetch_last_close_event` now also returns the latest `closed` event timestamp, and `was_closed_by_agent_shin` requires the most recent Agent Shin marker comment to sit at (or just before) that timestamp, with a small skew window for clock drift between the events and comments APIs. In the same path the LLM verdict check used `decision != "fail"` to choose the reopen branch, which treated a missing, empty, or typo verdict as a pass. Reopen is destructive, so the check now requires an explicit `decision == "pass"` and ambiguous verdicts fall through to the "still failing" branch instead. * style(agent-shin): black-format reconsider guard hardening * docs(agent-shin): scope dry-run wrapper docstring to the single existing helper The module docstring claimed it wrapped every Agent Shin mutation and referenced post_comment/close_pr/etc., but only maybe_post_comment exists. Describe the single helper accurately while keeping the dry-run pattern guidance for any future wrapper. * chore(agent-shin): defer issue/PR template changes to the rollout PR The triage and review-gate automation is gated to the allowlisted authors (mateo-berri, SwiftWinds) and AGENT_SHIN_ENABLED, so during this rollout it only acts on internal PRs/issues. The issue and PR templates have no such gate; they change for every contributor on merge and advertise that an LLM bot auto-closes external submissions, which won't happen while the allowlist is the sole author gate. Revert bug_report.yml, feature_request.yml, and pull_request_template.md to base so the public-facing messaging lands with the rollout flip instead of ahead of it. The scripts embed their own rubric and never read these files, so triage behavior is unchanged. * ci(agent-shin): hash-pin the openai install in privileged triage workflows The triage workflows install the OpenAI client with `pip install "openai>=1.40.0"`, a floating lower bound that resolves openai and its whole transitive tree to whatever PyPI serves at run time. These jobs run under pull_request_target with a write-scoped GITHUB_TOKEN, and the install plus the triage run happen on every PR open regardless of the AGENT_SHIN_ENABLED dry-run gate (that gate only withholds the LLM key and the destructive --close path), so a compromised release would execute during install or import while the token is in scope. Install instead from a new .github/scripts/triage-requirements.txt that pins openai==2.33.0 and every transitive dependency to an exact version with sha256 hashes, via pip --require-hashes. The workflows already sparse-checkout .github/scripts from the base repo (never fork code), so the pinned file is trusted. Add static guardrails to test_github_triage_workflows.py that fail if any installer workflow reverts to a floating openai install or if the requirements file loses its exact pins or hashes. * ci(agent-shin): gate rollout heads-up real run behind manual dispatch The rollout heads-up workflow fired its real `--close` sweep on every push to litellm_internal_staging that touched the script, and exposed OPENAI_API_KEY unconditionally, unlike every sibling triage workflow which only exposes the key on an enabled or dispatched run. That made merging the script post real heads-up comments (bounded only by the dogfood allowlist), which contradicts the inert-by-default safety invariant; once the allowlist is cleared for the public rollout, any later edit to the file would sweep the whole open backlog with real writes. The heads-up cannot be gated on AGENT_SHIN_ENABLED: its whole job is to warn contributors before that flag flips on, so it has to run while the flag is still off. Instead the automatic push trigger now stays dry-run, and the real one-shot sweep is a deliberate manual workflow_dispatch with dry_run=false, the sole path that adds `--close`. OPENAI_API_KEY is exposed only on that dispatch, matching the sibling workflows. Add static guardrails that fail if the push path regains a `--close`, if the dispatch gate stops fail-closing on the exact string "false", or if the key is exposed unconditionally again. --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: Mateo Wang <mateo-berri@users.noreply.github.com> Co-authored-by: Yassin Kortam <yassin@berri.ai> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Mateo <mateo@Mateos-MacBook-Pro.local>
Automated staging PR created by litellm-agent.
This branch collects PRs approved by the agent on 5/17/2026.