Skip to content

fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490

Open
skinred78 wants to merge 1 commit into
HKUDS:devfrom
skinred78:fix/gemini-2.5-thinking-tokens
Open

fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490
skinred78 wants to merge 1 commit into
HKUDS:devfrom
skinred78:fix/gemini-2.5-thinking-tokens

Conversation

@skinred78
Copy link
Copy Markdown

Fixes #489.

Summary

Six bugs that combined to break the Visualize capability on Gemini 2.5 Flash (and similar thinking-by-default models). Each is independently useful, but the user-visible symptom — Visualize output randomly truncated at ~370 chars — needs all of them.

1. Gemini 2.5/3.x reasoning-tokens default (root cause)

Gemini 2.5+ models burn most of max_tokens on internal "thinking" tokens by default. With max_tokens=4096, ~3900 tokens went to reasoning and ~160 came out as visible content, causing finish_reason=length on every multi-step pipeline (Visualize codegen + review, Deep Solve writing, etc.). See #489 for the reproducer with curl directly against Gemini's OpenAI-compat endpoint.

Default reasoning_effort="none" for Gemini 2.5/3.x when the caller doesn't specify, in all three execution paths:

  • provider_core/openai_compat_provider.py:_build_kwargs (live path)
  • executors.py:sdk_complete / sdk_stream (SDK path)
  • cloud_provider.py:_openai_complete / _openai_stream (aiohttp fallback)

Callers that want thinking can still opt in via explicit reasoning_effort.

2. visualize capability had no agents.yaml entry

get_agent_params("visualize") silently fell through to the 4096-token default because there was no entry in DEFAULT_AGENTS_SETTINGS or loader.py:section_map. Added both, with a 16384-token budget appropriate for full HTML pages.

3. Review stage crashed hard on JSON parse failure

ReviewAgent.process does ReviewResult.model_validate(extract_json_object(response)). When the model returned prose instead of JSON (common with large SVGs that don't escape cleanly into a JSON string field), the parse raised and killed the entire Visualize turn — the user saw the streamed SVG draft and then a traceback. Wrapped pipeline.run_review() in try/except so a parse failure falls back to the unreviewed draft and the user still gets a rendered result.

4. Codegen output not trimmed to the root tag

Models often wrap SVG/HTML in prose ("Here you go: … Enjoy!") or emit a closing code fence on the same line as </html>, which extract_code_block's regex (requiring a leading \n before the fence) doesn't strip. Added defensive root-tag trimming for render_type=="svg" and render_type=="html" so the renderer always sees a clean payload.

Test plan

Verified end-to-end on Gemini 2.5 Flash via the CLI and via headless Playwright:

  • Long-division HTML test prompt: 22 KB self-contained interactive page (was: 0 bytes / crash on dev).

  • All step prompts read correctly (e.g. "How many whole times does 6 go into 7?" for the first digit).

  • Walking the full 7852 ÷ 6 algorithm step by step terminates with Quotient: 1308, Remainder: 4.

  • Wrong-answer retry preserves prior progress; 3 wrong attempts reveals the answer and auto-advances (graceful-fallback path exercised).

  • Simple SVG (e.g. "draw 8 cookies") still renders unchanged — no regression on the happy path.

  • The defensive root-tag trim is a no-op when codegen already returns a clean tag.

  • Maintainer: please run pre-commit run --all-files and CI as a sanity check; my local environment doesn't have all the lint deps configured.

Six bugs that combined to break the Visualize capability on Gemini 2.5
Flash (and similar thinking-by-default models). Each is independently
useful, but the user-visible symptom — Visualize "kind of works but
output is randomly truncated at ~370 chars" — needs all of them.

1. Gemini 2.5/3.x reasoning-tokens default (root cause)

   Gemini 2.5+ models burn most of `max_tokens` on internal "thinking"
   tokens by default. With `max_tokens=4096`, ~3900 went to reasoning
   and only ~160 came out as actual content, causing finish_reason=length
   on every multi-step pipeline (Visualize codegen + review, Deep Solve,
   anything that asks for a structured output beyond a sentence).

   Default `reasoning_effort="none"` for Gemini 2.5/3.x models when the
   caller doesn't specify, in all three execution paths:
   - provider_core/openai_compat_provider.py:_build_kwargs (live path)
   - executors.py:sdk_complete / sdk_stream (legacy SDK path)
   - cloud_provider.py:_openai_complete / _openai_stream (aiohttp fallback)

2. visualize capability had no agents.yaml entry

   `get_agent_params("visualize")` silently fell through to the 4096
   default because there was no section_map entry and no
   DEFAULT_AGENTS_SETTINGS entry. Added both, with a 16384-token budget
   appropriate for full HTML pages.

3. Review stage crashed hard on JSON parse failure

   `ReviewAgent.process` does `ReviewResult.model_validate(extract_json_object(response))`.
   When the model returned prose instead of JSON (common with large SVGs
   that the model can't escape into a JSON string), the parse raised and
   killed the entire turn. Wrapped pipeline.run_review() in try/except
   so review failure falls back to the unreviewed draft and the user
   still gets a rendered result.

4. Codegen output not trimmed to the root tag

   Models often wrap SVG/HTML in prose ("Here you go: <svg>…</svg>
   Enjoy!") or emit a closing code fence on the same line as `</html>`,
   which `extract_code_block`'s regex (requiring a leading \n before
   the fence) doesn't strip. Added defensive root-tag trimming for
   render_type=="svg" and render_type=="html".

Verified end-to-end on Gemini 2.5 Flash via the CLI and headless
Playwright: full 22 KB long-division HTML page, no truncation, all
interactive elements present, multi-step walkthrough completes
correctly (7852 ÷ 6 → 1308 R 4).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant