A containerized, microservices-based video processing pipeline that extracts audio from videos and transcribes them using AI. Services are orchestrated via Docker Compose with N8N handling workflow automation.
Google Drive (video)
→ N8N: write to /tmp
→ Converter (FFmpeg) → .mp3 → /tmp
→ Transcript (Whisper) → JSON segments
→ N8N: AI Agent (Groq) → Markdown summary
→ Google Drive (Markdown file)
The pipeline runs as an N8N workflow with the following steps:
| # | Step | Description |
|---|---|---|
| 1 | Download file | Downloads the source video from Google Drive |
| 2 | Write to disk | Saves the binary to /tmp for shared volume access |
| 3 | Convert video to mp3 | Calls the Converter service to extract audio |
| 4 | Transcribe video | Calls the Transcript service to produce a text transcript |
| 5 | Edit Fields | Maps and reshapes the transcript response fields |
| 6 | AI Agent (Groq) | Sends the transcript to a Groq Chat Model to generate a structured summary |
| 7 | Create Markdown in Drive | Writes the AI-generated summary as a Markdown file to Google Drive |
| Service | Port | Technology | Role |
|---|---|---|---|
| N8N | 5678 | N8N | Workflow orchestration |
| Converter | 5001 | Go 1.25, Gin, FFmpeg | Extracts mono MP3 audio from video |
| Transcript | 5002 | Python 3.11, FastAPI, faster-whisper | Transcribes audio to text |
- Docker and Docker Compose
- A HuggingFace account and access token (for Whisper model downloads)
- A Groq account and API key (for the AI summarization step)
- A Google account with Google Drive access (configured as N8N credentials)
- Clone the repository
git clone <repository-url>
cd video-summarizer- Set up environment variables
Create a .env file in the project root:
HF_TOKEN=your_huggingface_token_here- Start the services
docker compose up --build- Access N8N at http://localhost:5678 to configure your workflow.
Reads a video file from /tmp, extracts audio, and writes a mono MP3 at 16kHz/32kbps back to /tmp.
Example:
GET http://localhost:5001/convert/video.mp4
Response:
{
"output": "video.mp3",
"path": "/tmp/video.mp3"
}Reads an MP3 from /tmp and returns a timestamped transcription with language detection.
Example:
GET http://localhost:5002/transcribe/video.mp3
Response:
{
"filename": "video.mp3",
"language": "Portuguese",
"language_iso": "pt_BR",
"language_probability": 0.9981,
"transcript": "Segment one text.\nSegment two text."
}Supported languages: Portuguese, Spanish, English, French.
| Variable | Service | Default | Description |
|---|---|---|---|
HF_TOKEN |
Transcript | — | HuggingFace token for model download |
WHISPER_MODEL |
Transcript | medium |
Whisper model size |
API_PORT |
Converter | 5001 |
HTTP port override |
video-summarizer/
├── converter/ # Go-based video-to-audio converter
│ ├── main.go
│ ├── go.mod
│ └── Dockerfile
├── transcript/ # Python-based audio transcription
│ ├── app.py
│ ├── requirements.txt
│ └── Dockerfile
├── videos/ # Local video file storage
├── docker-compose.yml
└── .env