Deep Convolutional Neural Networks for Audio Signal Classification(ESC-50)

This repository contains an end-to-end pipeline to:

Train a convolutional neural network (CNN) for environmental sound classification on ESC-50.
Serve the model for inference using Modal + a FastAPI endpoint.
Visualize model outputs (top predictions, input mel-spectrogram, waveform, and CNN feature maps).
- A Next.js UI lives in audio-cnn-visualization/
- A Streamlit UI lives at repo root as app.py

If you are reading this for the first time, start with:

train.py (training)
model.py (architecture)
main.py (Modal inference endpoint)
audio-cnn-visualization/src/app/page.tsx (Next.js visualization client)
app.py (Streamlit visualization client)

Interface

High-level architecture

flowchart TD
  A[ESC-50 dataset
Modal image downloads ZIP] --> B[Training on Modal GPU
train.py: train]
  B --> C[Checkpoint artifact
/models/best_model.pth
Modal Volume: esc-model]

  D[Client uploads WAV
Next.js or Streamlit] --> E[POST /inference
main.py: AudioClassifier.inference]
  E --> F[Preprocess audio
mono + resample + mel-spectrogram]
  F --> G[Forward pass
model.py: AudioCNN.forwardreturn_feature_maps=True]
  G --> H[Response JSON
predictions + waveform + spectrogram + feature maps]
  H --> I[UI renders
Top-3 + spectrogram + waveform + conv outputs]

  C --> E

1 Model architecture (`model.py`)

Key classes

ResidualBlock
- Implements a basic residual block:
  - conv1 -> bn1 -> relu -> conv2 -> bn2 -> add shortcut -> relu
- When feature map capture is enabled, it saves:
  - f"{prefix}.conv" (pre-ReLU residual sum)
  - f"{prefix}.relu" (post-ReLU)
AudioCNN
- A ResNet-like CNN adapted for 2D audio features.
- Major stages:
  - conv1: initial conv/bn/relu/maxpool
  - layer1..layer4: stacks of ResidualBlock with downsampling at the first block of each stage
  - avgpool -> dropout -> fc

Important method

AudioCNN.forward(x, return_feature_maps=False)
- Normal training/inference logits: returns logits of shape [batch, num_classes].
- Visualization mode (return_feature_maps=True):
  - Returns (logits, feature_maps_dict)
  - feature_maps_dict contains tensors like:
    - "conv1", "layer1".."layer4"
    - internal block activations: "layer2.block0.conv", "layer2.block0.relu", etc.

2) Training pipeline (`train.py`)

What training does

Training runs on Modal (GPU) and:

Downloads ESC-50 into the Modal image at build time.
Loads audio clips and converts them into mel-spectrograms.
Trains AudioCNN with augmentation.
Tracks metrics in TensorBoard.
Saves the best checkpoint to a Modal Volume.

Key components and functions

ESC50Dataset
- Reads esc50.csv metadata.
- Uses folds for splitting:
  - split='train': all folds except fold == 5
  - split='test': only fold == 5
- Returns (spectrogram, label).
mixup_data(x, y) and mixup_criterion(criterion, pred, y_a, y_b, lam)
- Implements MixUp augmentation probabilistically during training.
train()
- Defines transforms:
  - Train: MelSpectrogram -> AmplitudeToDB -> FrequencyMasking -> TimeMasking
  - Val: MelSpectrogram -> AmplitudeToDB
- Optimization:
  - AdamW
  - OneCycleLR
  - CrossEntropyLoss(label_smoothing=0.1)
- Checkpoint saved to Modal volume path:
  - /models/best_model.pth
- Saved checkpoint contains:
  - model_state_dict
  - accuracy
  - epoch
  - classes (class names in sorted order)

How to run training

Training is authored as a Modal app:

modal run train.py

This triggers train.remote() from @app.local_entrypoint().

3) Inference service (`main.py`)

main.py defines a Modal app (audio-cnn-inference) that exposes a FastAPI endpoint.

Main classes and methods

AudioProcessor
- Holds the preprocessing transform:
  - MelSpectrogram(sample_rate=22050, n_fft=1024, hop_length=512, n_mels=128, f_max=11025)
  - AmplitudeToDB()
- process_audio_chunk(audio_data):
  - Converts the 1D numpy waveform to a torch tensor
  - Produces a spectrogram tensor shaped like [1, 1, n_mels, time].
AudioClassifier (Modal class)
- load_model() (runs on container start via @modal.enter()):
  - Loads /models/best_model.pth from Modal volume esc-model
  - Initializes AudioCNN(num_classes=len(classes))
  - Loads weights and sets eval mode
- inference(request: InferenceRequest) (exposed via @modal.fastapi_endpoint(method="POST")):
  1. Decode input: base64 -> WAV bytes
  2. Read audio: soundfile.read(...) -> audio_data, sample_rate
  3. Convert to mono: mean over channels if needed
  4. Resample to 44100 if needed (via librosa.resample)
  5. Compute mel-spectrogram (via AudioProcessor)
  6. Forward pass with feature maps:
    - output, feature_maps = self.model(spectrogram, return_feature_maps=True)
  7. Top-3 predictions:
    - softmax -> torch.topk(k=3)
  8. Prepare visualization tensors:
    - For each feature map [1, C, H, W]:
      - channel-average to [H, W]
      - convert to JSON serializable lists
  9. Waveform downsampling (for UI):
    - keeps at most max_samples = 8000 values

Endpoint contract (what the UIs consume)

Method: POST
Content-Type: application/json
Body:

{ "audio_data": "<base64 WAV bytes>" }

Response:

{
  "predictions": [
    { "class": "chirping_birds", "confidence": 0.9339 },
    { "class": "crickets", "confidence": 0.003 },
    { "class": "clock_alarm", "confidence": 0.003 }
  ],
  "input_spectrogram": { "shape": [128, 431], "values": [[...], ...] },
  "waveform": { "values": [...], "sample_rate": 44100, "duration": 5.0 },
  "visualization": {
    "conv1": { "shape": [32, 108], "values": [[...], ...] },
    "layer1": { "shape": [32, 108], "values": [[...], ...] },
    "layer1.block0.conv": { "shape": [32, 108], "values": [[...], ...] }
  }
}

Note: values arrays may contain normalized/log-scaled values depending on the tensor.

How to run inference

To run the Modal app:

modal run main.py

To get a stable web endpoint you typically deploy on Modal (exact workflow depends on your Modal setup), but the Next.js UI in this repo currently points at a hosted endpoint URL.

4) Visualization UIs

You have two clients that call the same endpoint and render the returned fields.

A) Next.js UI (`audio-cnn-visualization/`)

Entry point: audio-cnn-visualization/src/app/page.tsx

What it does:

Reads a .wav file in the browser (FileReader -> ArrayBuffer)
Base64 encodes bytes
Calls the inference endpoint via fetch()
Renders:
- Top-3 predictions with progress bars
- Input spectrogram heatmap (SVG)
- Waveform (SVG)
- Convolutional outputs grid (SVG)

Important functions/components:

handleFileChange(...) (in page.tsx)
- The main orchestration function: upload -> base64 -> POST -> set state.
splitLayers(visualization) (in page.tsx)
- Splits visualization keys into:
  - main layers (no .)
  - internal layers grouped under their parent
FeatureMap (src/components/FeatureMap.tsx)
- Renders a 2D values array as an SVG grid using getColor().
Waveform (src/components/Waveform.tsx)
- Renders waveform samples as an SVG polyline.

Run it (from the audio-cnn-visualization/ directory):

npm install
npm run dev

B) Streamlit UI (`app.py`)

This repo also includes a standalone Streamlit UI that mimics the layout of the Next.js page.

Entry point: app.py

Key functions:

_call_inference_endpoint(file_path, endpoint_url)
- Reads WAV bytes from disk, base64 encodes, POSTs JSON.
run_visualization(audio_file, endpoint_url)
- Parses response fields and builds:
  - HTML progress bars for predictions
  - PIL image for spectrogram heatmap
  - Matplotlib plot for waveform
  - HTML grid for conv outputs

Run it:

python app.py

Override endpoint (optional):

export AUDIO_CNN_ENDPOINT="https://your-endpoint"
python app.py

Request/Response flow (detailed)

sequenceDiagram
  autonumber
  participant U as User
  participant UI as UI (Next.js or Streamlit)
  participant API as Modal FastAPI Endpoint
  participant P as Preprocess (AudioProcessor)
  participant M as Model (AudioCNN)

  U->>UI: Upload .wav
  UI->>UI: Read bytes + base64 encode
  UI->>API: POST {audio_data: base64}
  API->>API: Decode base64
  API->>P: mono + resample + mel-spectrogram
  P-->>API: spectrogram tensor
  API->>M: forward(return_feature_maps=True)
  M-->>API: logits + feature_maps
  API->>API: softmax + topk(3)
  API-->>UI: JSON (predictions, waveform, spectrogram, visualization)
  UI-->>U: Render charts + feature maps

Repository layout

model.py
- CNN model definition used by training and inference.
train.py
- ESC-50 dataset loading + training loop.
- Runs on Modal GPU.
- Saves best_model.pth to a Modal volume.
main.py
- Modal app exposing a FastAPI POST endpoint for inference.
- Loads best_model.pth from the same Modal volume.
- Returns predictions + visualization tensors as JSON.
app.py
- A Streamlit UI that calls the same inference endpoint as the Next.js UI.
audio-cnn-visualization/
- Next.js/Tailwind UI for visualizing the response.

Local development quickstart

Python (training, inference code, Streamlit)

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run Streamlit UI
python app.py

Next.js UI

cd audio-cnn-visualization
npm install
npm run dev

Troubleshooting

Streamlit shows an API error
- Check the endpoint URL and that it is reachable.
- Some hosted Modal endpoints can change unless deployed/stabilized.
Slow response
- Feature-map extraction + serialization can be heavy.
- Large audio clips can increase preprocessing time.
Audio not 44.1kHz
- main.py resamples to 44100 before generating the spectrogram.

What to read next (recommended)

Start here: main.py to understand the live request pipeline.
Then: model.py to understand what the feature maps represent.
Then: train.py to understand how the checkpoint is produced.
Finally: page.tsx and app.py to see how the visualization is built from the response.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
audio-cnn-visualization		audio-cnn-visualization
tensorboard_logs		tensorboard_logs
.gitignore		.gitignore
README.md		README.md
app.py		app.py
chirpingbirds.wav		chirpingbirds.wav
main.py		main.py
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Convolutional Neural Networks for Audio Signal Classification(ESC-50)

Interface

High-level architecture

1 Model architecture (`model.py`)

Key classes

Important method

2) Training pipeline (`train.py`)

What training does

Key components and functions

How to run training

3) Inference service (`main.py`)

Main classes and methods

Endpoint contract (what the UIs consume)

How to run inference

4) Visualization UIs

A) Next.js UI (`audio-cnn-visualization/`)

B) Streamlit UI (`app.py`)

Request/Response flow (detailed)

Repository layout

Local development quickstart

Python (training, inference code, Streamlit)

Next.js UI

Troubleshooting

What to read next (recommended)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Deep Convolutional Neural Networks for Audio Signal Classification(ESC-50)

Interface

High-level architecture

1 Model architecture (model.py)

Key classes

Important method

2) Training pipeline (train.py)

What training does

Key components and functions

How to run training

3) Inference service (main.py)

Main classes and methods

Endpoint contract (what the UIs consume)

How to run inference

4) Visualization UIs

A) Next.js UI (audio-cnn-visualization/)

B) Streamlit UI (app.py)

Request/Response flow (detailed)

Repository layout

Local development quickstart

Python (training, inference code, Streamlit)

Next.js UI

Troubleshooting

What to read next (recommended)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1 Model architecture (`model.py`)

2) Training pipeline (`train.py`)

3) Inference service (`main.py`)

A) Next.js UI (`audio-cnn-visualization/`)

B) Streamlit UI (`app.py`)

Packages