Skip to content

[Enhancement] Adding support for gpt-4o-audio#61

Merged
amitsnow merged 12 commits intomainfrom
scratch/gpt-4o-audio
Nov 14, 2025
Merged

[Enhancement] Adding support for gpt-4o-audio#61
amitsnow merged 12 commits intomainfrom
scratch/gpt-4o-audio

Conversation

@amitsnow
Copy link
Collaborator

@amitsnow amitsnow commented Nov 11, 2025

Summary

This PR implements support for OpenAI's gpt-4o-audio multimodal model in SyGra. Unlike traditional TTS models (tts-1) that use the audio.speech.create API, gpt-4o-audio uses the chat.completions.create API and supports bidirectional audio/text transformations with context-aware conversational responses.

Features implemented:

  • Audio Format Conversion: Automatically converts SyGra's audio_url format to OpenAI's input_audio format with proper data and format fields
  • Dynamic Modalities Management: Intelligently sets modalities (text, audio, or both) based on input_type, output_type, and actual audio content in messages
  • Consistency: Maintains consistency with other multimodal features (TTS, image generation, etc.)

Performance impact (if any):

N/A - No significant performance impact. The implementation adds a lightweight routing check and uses existing client infrastructure instead of manual processing.

How to Test

Prerequisites

  1. Set up OpenAI API key: export OPENAI_API_KEY=your_key_here
  2. Ensure SyGra is installed with all dependencies

Test Case 1: Text-to-Audio (TTS via Chat Completions)

model:
  name: gpt4o_audio
  model: gpt-4o-audio-preview
  output_type: audio
  parameters:
    voice: alloy
    response_format: wav

prompts:
  - user: "Please read this: Hello, this is a test of the GPT-4o audio model."

Steps:

  1. Run pipeline
  2. Check that audio file is created in multimodal_output/audio/
  3. Verify the audio file plays back the text content

Test Case 2: Audio-to-Text (Transcription)

model:
  name: gpt4o_audio
  model: gpt-4o-audio-preview
  
prompts:
  - user:
      - type: audio_url
        audio_url: "{<audio_field>}"
      - type: text
        text: "Please transcribe this audio."

Steps:

  1. Run pipeline
  2. Verify output contains transcribed text
  3. Check logs for proper audio input detection

Test Case 3: Audio-to-Audio (Translation/Transformation)

model:
  name: gpt4o_audio
  model: gpt-4o-audio-preview
  output_type: audio
  parameters:
    voice: nova
    response_format: mp3

Steps:

  1. Provide audio input with transformation instruction
  2. Run the pipeline
  3. Verify audio output is generated with specified voice and format

Test Case 4: Run Unit Tests

Expected Result: All tests should pass

Screenshots (if applicable)

N/A

Example Configuration:

model:
  name: gpt4o_audio
  model: gpt-4o-audio-preview
  output_type: audio
  parameters:
    voice: alloy  # Options: alloy, echo, fable, onyx, nova, shimmer
    response_format: wav  # Options: wav, mp3, opus, aac, flac, pcm

Example Output:

{
  "id": "record_0",
  "response": "file:multimodal_output/audio/record_0_response_0.wav"
}

Checklist

  • Lint fixes and unit testing done
  • End to end task testing
  • Documentation updated

@amitsnow amitsnow marked this pull request as ready for review November 12, 2025 19:11
@amitsnow amitsnow requested a review from a team as a code owner November 12, 2025 19:11
@amitsnow amitsnow self-assigned this Nov 12, 2025
@amitsnow amitsnow added the enhancement New feature or request label Nov 12, 2025
@amitsnow amitsnow changed the title Adding support for gpt-4o-audio [Enhancement] Adding support for gpt-4o-audio Nov 12, 2025
Copy link
Collaborator

@psriramsnc psriramsnc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

psriramsnc

This comment was marked as duplicate.

@amitsnow amitsnow requested a review from a team November 14, 2025 04:24
@amitsnow amitsnow merged commit 6f8937f into main Nov 14, 2025
6 checks passed
@amitsnow amitsnow deleted the scratch/gpt-4o-audio branch November 14, 2025 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants