fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490
Open
skinred78 wants to merge 1 commit into
Open
fix(visualize): unblock Gemini 2.5+ and harden Visualize pipeline#490skinred78 wants to merge 1 commit into
skinred78 wants to merge 1 commit into
Conversation
Six bugs that combined to break the Visualize capability on Gemini 2.5
Flash (and similar thinking-by-default models). Each is independently
useful, but the user-visible symptom — Visualize "kind of works but
output is randomly truncated at ~370 chars" — needs all of them.
1. Gemini 2.5/3.x reasoning-tokens default (root cause)
Gemini 2.5+ models burn most of `max_tokens` on internal "thinking"
tokens by default. With `max_tokens=4096`, ~3900 went to reasoning
and only ~160 came out as actual content, causing finish_reason=length
on every multi-step pipeline (Visualize codegen + review, Deep Solve,
anything that asks for a structured output beyond a sentence).
Default `reasoning_effort="none"` for Gemini 2.5/3.x models when the
caller doesn't specify, in all three execution paths:
- provider_core/openai_compat_provider.py:_build_kwargs (live path)
- executors.py:sdk_complete / sdk_stream (legacy SDK path)
- cloud_provider.py:_openai_complete / _openai_stream (aiohttp fallback)
2. visualize capability had no agents.yaml entry
`get_agent_params("visualize")` silently fell through to the 4096
default because there was no section_map entry and no
DEFAULT_AGENTS_SETTINGS entry. Added both, with a 16384-token budget
appropriate for full HTML pages.
3. Review stage crashed hard on JSON parse failure
`ReviewAgent.process` does `ReviewResult.model_validate(extract_json_object(response))`.
When the model returned prose instead of JSON (common with large SVGs
that the model can't escape into a JSON string), the parse raised and
killed the entire turn. Wrapped pipeline.run_review() in try/except
so review failure falls back to the unreviewed draft and the user
still gets a rendered result.
4. Codegen output not trimmed to the root tag
Models often wrap SVG/HTML in prose ("Here you go: <svg>…</svg>
Enjoy!") or emit a closing code fence on the same line as `</html>`,
which `extract_code_block`'s regex (requiring a leading \n before
the fence) doesn't strip. Added defensive root-tag trimming for
render_type=="svg" and render_type=="html".
Verified end-to-end on Gemini 2.5 Flash via the CLI and headless
Playwright: full 22 KB long-division HTML page, no truncation, all
interactive elements present, multi-step walkthrough completes
correctly (7852 ÷ 6 → 1308 R 4).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #489.
Summary
Six bugs that combined to break the Visualize capability on Gemini 2.5 Flash (and similar thinking-by-default models). Each is independently useful, but the user-visible symptom — Visualize output randomly truncated at ~370 chars — needs all of them.
1. Gemini 2.5/3.x reasoning-tokens default (root cause)
Gemini 2.5+ models burn most of
max_tokenson internal "thinking" tokens by default. Withmax_tokens=4096, ~3900 tokens went to reasoning and ~160 came out as visible content, causingfinish_reason=lengthon every multi-step pipeline (Visualize codegen + review, Deep Solve writing, etc.). See #489 for the reproducer withcurldirectly against Gemini's OpenAI-compat endpoint.Default
reasoning_effort="none"for Gemini 2.5/3.x when the caller doesn't specify, in all three execution paths:provider_core/openai_compat_provider.py:_build_kwargs(live path)executors.py:sdk_complete/sdk_stream(SDK path)cloud_provider.py:_openai_complete/_openai_stream(aiohttp fallback)Callers that want thinking can still opt in via explicit
reasoning_effort.2.
visualizecapability had noagents.yamlentryget_agent_params("visualize")silently fell through to the 4096-token default because there was no entry inDEFAULT_AGENTS_SETTINGSorloader.py:section_map. Added both, with a 16384-token budget appropriate for full HTML pages.3. Review stage crashed hard on JSON parse failure
ReviewAgent.processdoesReviewResult.model_validate(extract_json_object(response)). When the model returned prose instead of JSON (common with large SVGs that don't escape cleanly into a JSON string field), the parse raised and killed the entire Visualize turn — the user saw the streamed SVG draft and then a traceback. Wrappedpipeline.run_review()in try/except so a parse failure falls back to the unreviewed draft and the user still gets a rendered result.4. Codegen output not trimmed to the root tag
Models often wrap SVG/HTML in prose ("Here you go: … Enjoy!") or emit a closing code fence on the same line as
</html>, whichextract_code_block's regex (requiring a leading\nbefore the fence) doesn't strip. Added defensive root-tag trimming forrender_type=="svg"andrender_type=="html"so the renderer always sees a clean payload.Test plan
Verified end-to-end on Gemini 2.5 Flash via the CLI and via headless Playwright:
Long-division HTML test prompt: 22 KB self-contained interactive page (was: 0 bytes / crash on
dev).All step prompts read correctly (e.g. "How many whole times does 6 go into 7?" for the first digit).
Walking the full 7852 ÷ 6 algorithm step by step terminates with
Quotient: 1308, Remainder: 4.Wrong-answer retry preserves prior progress; 3 wrong attempts reveals the answer and auto-advances (graceful-fallback path exercised).
Simple SVG (e.g. "draw 8 cookies") still renders unchanged — no regression on the happy path.
The defensive root-tag trim is a no-op when codegen already returns a clean tag.
Maintainer: please run
pre-commit run --all-filesand CI as a sanity check; my local environment doesn't have all the lint deps configured.