Skip to content

MadsDoodle/Deep-CNN-for-Audio-Signal-Classification

Repository files navigation

Deep Convolutional Neural Networks for Audio Signal Classification(ESC-50)

This repository contains an end-to-end pipeline to:

  • Train a convolutional neural network (CNN) for environmental sound classification on ESC-50.
  • Serve the model for inference using Modal + a FastAPI endpoint.
  • Visualize model outputs (top predictions, input mel-spectrogram, waveform, and CNN feature maps).
    • A Next.js UI lives in audio-cnn-visualization/
    • A Streamlit UI lives at repo root as app.py

If you are reading this for the first time, start with:

  • train.py (training)
  • model.py (architecture)
  • main.py (Modal inference endpoint)
  • audio-cnn-visualization/src/app/page.tsx (Next.js visualization client)
  • app.py (Streamlit visualization client)

Interface

Streamlit Dashboard


High-level architecture

flowchart TD
  A[ESC-50 dataset
Modal image downloads ZIP] --> B[Training on Modal GPU
train.py: train]
  B --> C[Checkpoint artifact
/models/best_model.pth
Modal Volume: esc-model]

  D[Client uploads WAV
Next.js or Streamlit] --> E[POST /inference
main.py: AudioClassifier.inference]
  E --> F[Preprocess audio
mono + resample + mel-spectrogram]
  F --> G[Forward pass
model.py: AudioCNN.forwardreturn_feature_maps=True]
  G --> H[Response JSON
predictions + waveform + spectrogram + feature maps]
  H --> I[UI renders
Top-3 + spectrogram + waveform + conv outputs]

  C --> E
Loading

1 Model architecture (model.py)

Key classes

  • ResidualBlock

    • Implements a basic residual block:
      • conv1 -> bn1 -> relu -> conv2 -> bn2 -> add shortcut -> relu
    • When feature map capture is enabled, it saves:
      • f"{prefix}.conv" (pre-ReLU residual sum)
      • f"{prefix}.relu" (post-ReLU)
  • AudioCNN

    • A ResNet-like CNN adapted for 2D audio features.
    • Major stages:
      • conv1: initial conv/bn/relu/maxpool
      • layer1..layer4: stacks of ResidualBlock with downsampling at the first block of each stage
      • avgpool -> dropout -> fc

Important method

  • AudioCNN.forward(x, return_feature_maps=False)
    • Normal training/inference logits: returns logits of shape [batch, num_classes].
    • Visualization mode (return_feature_maps=True):
      • Returns (logits, feature_maps_dict)
      • feature_maps_dict contains tensors like:
        • "conv1", "layer1".."layer4"
        • internal block activations: "layer2.block0.conv", "layer2.block0.relu", etc.

2) Training pipeline (train.py)

What training does

Training runs on Modal (GPU) and:

  • Downloads ESC-50 into the Modal image at build time.
  • Loads audio clips and converts them into mel-spectrograms.
  • Trains AudioCNN with augmentation.
  • Tracks metrics in TensorBoard.
  • Saves the best checkpoint to a Modal Volume.

Key components and functions

  • ESC50Dataset

    • Reads esc50.csv metadata.
    • Uses folds for splitting:
      • split='train': all folds except fold == 5
      • split='test': only fold == 5
    • Returns (spectrogram, label).
  • mixup_data(x, y) and mixup_criterion(criterion, pred, y_a, y_b, lam)

    • Implements MixUp augmentation probabilistically during training.
  • train()

    • Defines transforms:
      • Train: MelSpectrogram -> AmplitudeToDB -> FrequencyMasking -> TimeMasking
      • Val: MelSpectrogram -> AmplitudeToDB
    • Optimization:
      • AdamW
      • OneCycleLR
      • CrossEntropyLoss(label_smoothing=0.1)
    • Checkpoint saved to Modal volume path:
      • /models/best_model.pth
    • Saved checkpoint contains:
      • model_state_dict
      • accuracy
      • epoch
      • classes (class names in sorted order)

How to run training

Training is authored as a Modal app:

modal run train.py

This triggers train.remote() from @app.local_entrypoint().


3) Inference service (main.py)

main.py defines a Modal app (audio-cnn-inference) that exposes a FastAPI endpoint.

Main classes and methods

  • AudioProcessor

    • Holds the preprocessing transform:
      • MelSpectrogram(sample_rate=22050, n_fft=1024, hop_length=512, n_mels=128, f_max=11025)
      • AmplitudeToDB()
    • process_audio_chunk(audio_data):
      • Converts the 1D numpy waveform to a torch tensor
      • Produces a spectrogram tensor shaped like [1, 1, n_mels, time].
  • AudioClassifier (Modal class)

    • load_model() (runs on container start via @modal.enter()):

      • Loads /models/best_model.pth from Modal volume esc-model
      • Initializes AudioCNN(num_classes=len(classes))
      • Loads weights and sets eval mode
    • inference(request: InferenceRequest) (exposed via @modal.fastapi_endpoint(method="POST")):

      1. Decode input: base64 -> WAV bytes
      2. Read audio: soundfile.read(...) -> audio_data, sample_rate
      3. Convert to mono: mean over channels if needed
      4. Resample to 44100 if needed (via librosa.resample)
      5. Compute mel-spectrogram (via AudioProcessor)
      6. Forward pass with feature maps:
        • output, feature_maps = self.model(spectrogram, return_feature_maps=True)
      7. Top-3 predictions:
        • softmax -> torch.topk(k=3)
      8. Prepare visualization tensors:
        • For each feature map [1, C, H, W]:
          • channel-average to [H, W]
          • convert to JSON serializable lists
      9. Waveform downsampling (for UI):
        • keeps at most max_samples = 8000 values

Endpoint contract (what the UIs consume)

  • Method: POST
  • Content-Type: application/json
  • Body:
{ "audio_data": "<base64 WAV bytes>" }
  • Response:
{
  "predictions": [
    { "class": "chirping_birds", "confidence": 0.9339 },
    { "class": "crickets", "confidence": 0.003 },
    { "class": "clock_alarm", "confidence": 0.003 }
  ],
  "input_spectrogram": { "shape": [128, 431], "values": [[...], ...] },
  "waveform": { "values": [...], "sample_rate": 44100, "duration": 5.0 },
  "visualization": {
    "conv1": { "shape": [32, 108], "values": [[...], ...] },
    "layer1": { "shape": [32, 108], "values": [[...], ...] },
    "layer1.block0.conv": { "shape": [32, 108], "values": [[...], ...] }
  }
}

Note: values arrays may contain normalized/log-scaled values depending on the tensor.

How to run inference

To run the Modal app:

modal run main.py

To get a stable web endpoint you typically deploy on Modal (exact workflow depends on your Modal setup), but the Next.js UI in this repo currently points at a hosted endpoint URL.


4) Visualization UIs

You have two clients that call the same endpoint and render the returned fields.

A) Next.js UI (audio-cnn-visualization/)

  • Entry point: audio-cnn-visualization/src/app/page.tsx

What it does:

  1. Reads a .wav file in the browser (FileReader -> ArrayBuffer)
  2. Base64 encodes bytes
  3. Calls the inference endpoint via fetch()
  4. Renders:
    • Top-3 predictions with progress bars
    • Input spectrogram heatmap (SVG)
    • Waveform (SVG)
    • Convolutional outputs grid (SVG)

Important functions/components:

  • handleFileChange(...) (in page.tsx)
    • The main orchestration function: upload -> base64 -> POST -> set state.
  • splitLayers(visualization) (in page.tsx)
    • Splits visualization keys into:
      • main layers (no .)
      • internal layers grouped under their parent
  • FeatureMap (src/components/FeatureMap.tsx)
    • Renders a 2D values array as an SVG grid using getColor().
  • Waveform (src/components/Waveform.tsx)
    • Renders waveform samples as an SVG polyline.

Run it (from the audio-cnn-visualization/ directory):

npm install
npm run dev

B) Streamlit UI (app.py)

This repo also includes a standalone Streamlit UI that mimics the layout of the Next.js page.

  • Entry point: app.py

Key functions:

  • _call_inference_endpoint(file_path, endpoint_url)
    • Reads WAV bytes from disk, base64 encodes, POSTs JSON.
  • run_visualization(audio_file, endpoint_url)
    • Parses response fields and builds:
      • HTML progress bars for predictions
      • PIL image for spectrogram heatmap
      • Matplotlib plot for waveform
      • HTML grid for conv outputs

Run it:

python app.py

Override endpoint (optional):

export AUDIO_CNN_ENDPOINT="https://your-endpoint"
python app.py

Request/Response flow (detailed)

sequenceDiagram
  autonumber
  participant U as User
  participant UI as UI (Next.js or Streamlit)
  participant API as Modal FastAPI Endpoint
  participant P as Preprocess (AudioProcessor)
  participant M as Model (AudioCNN)

  U->>UI: Upload .wav
  UI->>UI: Read bytes + base64 encode
  UI->>API: POST {audio_data: base64}
  API->>API: Decode base64
  API->>P: mono + resample + mel-spectrogram
  P-->>API: spectrogram tensor
  API->>M: forward(return_feature_maps=True)
  M-->>API: logits + feature_maps
  API->>API: softmax + topk(3)
  API-->>UI: JSON (predictions, waveform, spectrogram, visualization)
  UI-->>U: Render charts + feature maps
Loading

Repository layout

  • model.py
    • CNN model definition used by training and inference.
  • train.py
    • ESC-50 dataset loading + training loop.
    • Runs on Modal GPU.
    • Saves best_model.pth to a Modal volume.
  • main.py
    • Modal app exposing a FastAPI POST endpoint for inference.
    • Loads best_model.pth from the same Modal volume.
    • Returns predictions + visualization tensors as JSON.
  • app.py
    • A Streamlit UI that calls the same inference endpoint as the Next.js UI.
  • audio-cnn-visualization/
    • Next.js/Tailwind UI for visualizing the response.


Local development quickstart

Python (training, inference code, Streamlit)

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run Streamlit UI
python app.py

Next.js UI

cd audio-cnn-visualization
npm install
npm run dev

Troubleshooting

  • Streamlit shows an API error

    • Check the endpoint URL and that it is reachable.
    • Some hosted Modal endpoints can change unless deployed/stabilized.
  • Slow response

    • Feature-map extraction + serialization can be heavy.
    • Large audio clips can increase preprocessing time.
  • Audio not 44.1kHz

    • main.py resamples to 44100 before generating the spectrogram.

What to read next (recommended)

  • Start here: main.py to understand the live request pipeline.
  • Then: model.py to understand what the feature maps represent.
  • Then: train.py to understand how the checkpoint is produced.
  • Finally: page.tsx and app.py to see how the visualization is built from the response.

About

This repository contains an end-to-end pipeline to train a convolutional neural network (CNN) for environmental sound classification on ESC-50. Serve the model for inference using Modal + a FastAPI endpoint. - Visualize model outputs (top predictions, input mel-spectrogram, waveform, and CNN feature maps).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors