Skip to content

feat(tools): image.generate and vision.analyze add multi-provider fallback (Replicate, Gemini) #288

Description

@subinium

Scope

Both image.generate and vision.analyze are hard-wired to call https://api.openai.com/v1 with no provider abstraction. If the OPENAI key is absent, they fail silently with a simulated: true metadata flag rather than trying available alternatives. For self-hosted or non-OpenAI deployments this is a hard blocker.

Current state

packages/tools/src/image-gen.ts line 73–74:

const baseUrl = ... ?? options?.providerBaseUrl ?? 'https://api.openai.com/v1';
const model = ... ?? options?.model ?? 'dall-e-3';

packages/tools/src/vision.ts line 108–109:

const baseUrl = ... ?? options?.providerBaseUrl ?? 'https://api.openai.com/v1';
const model = ... ?? options?.model ?? 'gpt-4o';

Both accept a providerBaseUrl override, which means they are already compatible with any OpenAI-compatible endpoint (Together.ai, Groq, Mistral for vision; Stability's OpenAI-compat endpoint for image gen). The gap is that the tool silently returns simulated: true rather than trying the next configured provider.

The voice.tts tool (packages/tools/src/voice.ts lines 41–87) already implements a two-backend cascade (edge-tts CLI → OpenAI API). The same pattern should apply here.

Proposed

  1. For vision.analyze: when OPENAI_API_KEY is absent, check GOOGLE_API_KEY and call Gemini Flash (https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent). Both support the same base64-image-in-request pattern.
  2. For image.generate: when DALL-E 3 key is absent, check REPLICATE_API_TOKEN and call stability-ai/sdxl via Replicate's REST API. This avoids a forced OpenAI dependency.
  3. Expose the active provider in metadata.provider for observability.
  4. Do NOT introduce new required dependencies — both Gemini and Replicate calls use plain fetch.

Cost justification: DALL-E 3 at $0.040/image vs SDXL via Replicate at ~$0.003/image is a 13x cost difference for bulk generation; the fallback has real operator value.

Acceptance

  • With OPENAI_API_KEY absent and GOOGLE_API_KEY set, vision.analyze calls Gemini, not OpenAI.
  • Metadata includes { provider: 'openai' | 'gemini' | 'replicate' }.
  • With no keys configured, returns ok: false with a clear message listing which env vars to set (not simulated: true with ok: true).

References

  • Internal: packages/tools/src/image-gen.ts lines 72–84
  • Internal: packages/tools/src/vision.ts lines 107–125
  • Internal: packages/tools/src/voice.ts lines 41–87 (cascade pattern to follow)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpriority/nitLow priority polishreliabilityReliability / correctness fixsource/auditInternal audit finding

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions