Skip to content

Gemini CLI should utilize 'pipelining' of write -> validate inner loop #20093

@gundermanc

Description

@gundermanc

Analysis of the Terminal Bench suite of evals yielded #19574 which utilizes a 'pipelining' strategy of optimizing the outputs of specific sequences of tool calls to eliminate turns. That strategy caused at least a 10% reduction in turn count in SWEBench with Gemini 3 flash and potentially significantly greater savings with Gemini 3.1 pro preview.

This issue tracks another pipelining change. Specifically optimizing the write_file -> validate loop performed by the agent during its inner loop validation.

Previously the agent would explicitly write the file and then have to explicitly remember to build and/or validate after each edit. This is time consuming and requires attention and is potentially error prone.

Instead, I propose the creation of dynamic validation hooks which the agent can register on-the-fly during the session. Each hook takes a file path pattern, and for all matching files/paths, runs that validation after a matching write_file, returning any non-success codes in the write_file response.

The benefit is an estimated 5-9% reduction in turns, faster scenario completion, and more systematic and thorough validation and feedback, particularly across long running sessions.

 Revised Methodology
  A turn ($T_{i+1}$) is classified as a "validation" step if it immediately follows a turn ($T_i$) where the agent called write_file, and meets one of the following criteria:
   1. Shell Validation: $T_{i+1}$ executes a shell command containing explicit build, test, or execution keywords (e.g., python, make, gcc, pytest).
   2. Read Verification: $T_{i+1}$ calls `read_file` on the exact file path that was just written in $T_i$, indicating a content integrity check.

  Refined Results (Baseline)



  ┌────────────────────────────────┬───────────┐
  │ Metric                         │ Value     │
  ├────────────────────────────────┼───────────┤
  │ Total Agent Turns          │ 2,735     │
  │ Confirmed Validation Turns │ 264       │
  │ Refined Percentage         │ 9.65% │
  └────────────────────────────────┴───────────┘



  Breakdown:
   - Shell Validation (Build/Test/Run): 259 turns.
   - Read Verification (Content Check): 5 turns.


  This shows that nearly 1 in 10 actions taken by the agent in the baseline run is a direct validation of code it just wrote. The slight decrease from my previous estimate (10.82% to 9.65%) is due to the exclusion of generic shell commands (like mkdir or cd) that
  followed a write_file but didn't actually perform a validation action.

Metadata

Metadata

Labels

area/agentIssues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Qualitykind/customer-issueIssues that were reported by customerspriority/p2Important but can be addressed in a future release.status/bot-triaged🔒 maintainer only⛔ Do not contribute. Internal roadmap item.

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions