Scope
Both image.generate and vision.analyze are hard-wired to call https://api.openai.com/v1 with no provider abstraction. If the OPENAI key is absent, they fail silently with a simulated: true metadata flag rather than trying available alternatives. For self-hosted or non-OpenAI deployments this is a hard blocker.
Current state
packages/tools/src/image-gen.ts line 73–74:
const baseUrl = ... ?? options?.providerBaseUrl ?? 'https://api.openai.com/v1';
const model = ... ?? options?.model ?? 'dall-e-3';
packages/tools/src/vision.ts line 108–109:
const baseUrl = ... ?? options?.providerBaseUrl ?? 'https://api.openai.com/v1';
const model = ... ?? options?.model ?? 'gpt-4o';
Both accept a providerBaseUrl override, which means they are already compatible with any OpenAI-compatible endpoint (Together.ai, Groq, Mistral for vision; Stability's OpenAI-compat endpoint for image gen). The gap is that the tool silently returns simulated: true rather than trying the next configured provider.
The voice.tts tool (packages/tools/src/voice.ts lines 41–87) already implements a two-backend cascade (edge-tts CLI → OpenAI API). The same pattern should apply here.
Proposed
- For
vision.analyze: when OPENAI_API_KEY is absent, check GOOGLE_API_KEY and call Gemini Flash (https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent). Both support the same base64-image-in-request pattern.
- For
image.generate: when DALL-E 3 key is absent, check REPLICATE_API_TOKEN and call stability-ai/sdxl via Replicate's REST API. This avoids a forced OpenAI dependency.
- Expose the active provider in
metadata.provider for observability.
- Do NOT introduce new required dependencies — both Gemini and Replicate calls use plain
fetch.
Cost justification: DALL-E 3 at $0.040/image vs SDXL via Replicate at ~$0.003/image is a 13x cost difference for bulk generation; the fallback has real operator value.
Acceptance
- With
OPENAI_API_KEY absent and GOOGLE_API_KEY set, vision.analyze calls Gemini, not OpenAI.
- Metadata includes
{ provider: 'openai' | 'gemini' | 'replicate' }.
- With no keys configured, returns
ok: false with a clear message listing which env vars to set (not simulated: true with ok: true).
References
- Internal:
packages/tools/src/image-gen.ts lines 72–84
- Internal:
packages/tools/src/vision.ts lines 107–125
- Internal:
packages/tools/src/voice.ts lines 41–87 (cascade pattern to follow)
Scope
Both
image.generateandvision.analyzeare hard-wired to callhttps://api.openai.com/v1with no provider abstraction. If the OPENAI key is absent, they fail silently with asimulated: truemetadata flag rather than trying available alternatives. For self-hosted or non-OpenAI deployments this is a hard blocker.Current state
packages/tools/src/image-gen.tsline 73–74:packages/tools/src/vision.tsline 108–109:Both accept a
providerBaseUrloverride, which means they are already compatible with any OpenAI-compatible endpoint (Together.ai, Groq, Mistral for vision; Stability's OpenAI-compat endpoint for image gen). The gap is that the tool silently returnssimulated: truerather than trying the next configured provider.The
voice.ttstool (packages/tools/src/voice.tslines 41–87) already implements a two-backend cascade (edge-tts CLI → OpenAI API). The same pattern should apply here.Proposed
vision.analyze: whenOPENAI_API_KEYis absent, checkGOOGLE_API_KEYand call Gemini Flash (https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent). Both support the same base64-image-in-request pattern.image.generate: when DALL-E 3 key is absent, checkREPLICATE_API_TOKENand callstability-ai/sdxlvia Replicate's REST API. This avoids a forced OpenAI dependency.metadata.providerfor observability.fetch.Cost justification: DALL-E 3 at $0.040/image vs SDXL via Replicate at ~$0.003/image is a 13x cost difference for bulk generation; the fallback has real operator value.
Acceptance
OPENAI_API_KEYabsent andGOOGLE_API_KEYset,vision.analyzecalls Gemini, not OpenAI.{ provider: 'openai' | 'gemini' | 'replicate' }.ok: falsewith a clear message listing which env vars to set (notsimulated: truewithok: true).References
packages/tools/src/image-gen.tslines 72–84packages/tools/src/vision.tslines 107–125packages/tools/src/voice.tslines 41–87 (cascade pattern to follow)