This repository contains an end-to-end pipeline to:
- Train a convolutional neural network (CNN) for environmental sound classification on ESC-50.
- Serve the model for inference using Modal + a FastAPI endpoint.
- Visualize model outputs (top predictions, input mel-spectrogram, waveform, and CNN feature maps).
- A Next.js UI lives in
audio-cnn-visualization/ - A Streamlit UI lives at repo root as
app.py
- A Next.js UI lives in
If you are reading this for the first time, start with:
train.py(training)model.py(architecture)main.py(Modal inference endpoint)audio-cnn-visualization/src/app/page.tsx(Next.js visualization client)app.py(Streamlit visualization client)
flowchart TD
A[ESC-50 dataset
Modal image downloads ZIP] --> B[Training on Modal GPU
train.py: train]
B --> C[Checkpoint artifact
/models/best_model.pth
Modal Volume: esc-model]
D[Client uploads WAV
Next.js or Streamlit] --> E[POST /inference
main.py: AudioClassifier.inference]
E --> F[Preprocess audio
mono + resample + mel-spectrogram]
F --> G[Forward pass
model.py: AudioCNN.forwardreturn_feature_maps=True]
G --> H[Response JSON
predictions + waveform + spectrogram + feature maps]
H --> I[UI renders
Top-3 + spectrogram + waveform + conv outputs]
C --> E
-
ResidualBlock- Implements a basic residual block:
conv1 -> bn1 -> relu -> conv2 -> bn2 -> add shortcut -> relu
- When feature map capture is enabled, it saves:
f"{prefix}.conv"(pre-ReLU residual sum)f"{prefix}.relu"(post-ReLU)
- Implements a basic residual block:
-
AudioCNN- A ResNet-like CNN adapted for 2D audio features.
- Major stages:
conv1: initial conv/bn/relu/maxpoollayer1..layer4: stacks ofResidualBlockwith downsampling at the first block of each stageavgpool -> dropout -> fc
AudioCNN.forward(x, return_feature_maps=False)- Normal training/inference logits: returns logits of shape
[batch, num_classes]. - Visualization mode (
return_feature_maps=True):- Returns
(logits, feature_maps_dict) feature_maps_dictcontains tensors like:"conv1","layer1".."layer4"- internal block activations:
"layer2.block0.conv","layer2.block0.relu", etc.
- Returns
- Normal training/inference logits: returns logits of shape
Training runs on Modal (GPU) and:
- Downloads ESC-50 into the Modal image at build time.
- Loads audio clips and converts them into mel-spectrograms.
- Trains
AudioCNNwith augmentation. - Tracks metrics in TensorBoard.
- Saves the best checkpoint to a Modal Volume.
-
ESC50Dataset- Reads
esc50.csvmetadata. - Uses folds for splitting:
split='train': all folds exceptfold == 5split='test': onlyfold == 5
- Returns
(spectrogram, label).
- Reads
-
mixup_data(x, y)andmixup_criterion(criterion, pred, y_a, y_b, lam)- Implements MixUp augmentation probabilistically during training.
-
train()- Defines transforms:
- Train:
MelSpectrogram -> AmplitudeToDB -> FrequencyMasking -> TimeMasking - Val:
MelSpectrogram -> AmplitudeToDB
- Train:
- Optimization:
AdamWOneCycleLRCrossEntropyLoss(label_smoothing=0.1)
- Checkpoint saved to Modal volume path:
/models/best_model.pth
- Saved checkpoint contains:
model_state_dictaccuracyepochclasses(class names in sorted order)
- Defines transforms:
Training is authored as a Modal app:
modal run train.pyThis triggers train.remote() from @app.local_entrypoint().
main.py defines a Modal app (audio-cnn-inference) that exposes a FastAPI endpoint.
-
AudioProcessor- Holds the preprocessing transform:
MelSpectrogram(sample_rate=22050, n_fft=1024, hop_length=512, n_mels=128, f_max=11025)AmplitudeToDB()
process_audio_chunk(audio_data):- Converts the 1D numpy waveform to a torch tensor
- Produces a spectrogram tensor shaped like
[1, 1, n_mels, time].
- Holds the preprocessing transform:
-
AudioClassifier(Modal class)-
load_model()(runs on container start via@modal.enter()):- Loads
/models/best_model.pthfrom Modal volumeesc-model - Initializes
AudioCNN(num_classes=len(classes)) - Loads weights and sets eval mode
- Loads
-
inference(request: InferenceRequest)(exposed via@modal.fastapi_endpoint(method="POST")):- Decode input: base64 -> WAV bytes
- Read audio:
soundfile.read(...)->audio_data,sample_rate - Convert to mono: mean over channels if needed
- Resample to 44100 if needed (via
librosa.resample) - Compute mel-spectrogram (via
AudioProcessor) - Forward pass with feature maps:
output, feature_maps = self.model(spectrogram, return_feature_maps=True)
- Top-3 predictions:
softmax -> torch.topk(k=3)
- Prepare visualization tensors:
- For each feature map
[1, C, H, W]:- channel-average to
[H, W] - convert to JSON serializable lists
- channel-average to
- For each feature map
- Waveform downsampling (for UI):
- keeps at most
max_samples = 8000values
- keeps at most
-
- Method:
POST - Content-Type:
application/json - Body:
{ "audio_data": "<base64 WAV bytes>" }- Response:
{
"predictions": [
{ "class": "chirping_birds", "confidence": 0.9339 },
{ "class": "crickets", "confidence": 0.003 },
{ "class": "clock_alarm", "confidence": 0.003 }
],
"input_spectrogram": { "shape": [128, 431], "values": [[...], ...] },
"waveform": { "values": [...], "sample_rate": 44100, "duration": 5.0 },
"visualization": {
"conv1": { "shape": [32, 108], "values": [[...], ...] },
"layer1": { "shape": [32, 108], "values": [[...], ...] },
"layer1.block0.conv": { "shape": [32, 108], "values": [[...], ...] }
}
}Note:
valuesarrays may contain normalized/log-scaled values depending on the tensor.
To run the Modal app:
modal run main.pyTo get a stable web endpoint you typically deploy on Modal (exact workflow depends on your Modal setup), but the Next.js UI in this repo currently points at a hosted endpoint URL.
You have two clients that call the same endpoint and render the returned fields.
- Entry point:
audio-cnn-visualization/src/app/page.tsx
What it does:
- Reads a
.wavfile in the browser (FileReader -> ArrayBuffer) - Base64 encodes bytes
- Calls the inference endpoint via
fetch() - Renders:
- Top-3 predictions with progress bars
- Input spectrogram heatmap (SVG)
- Waveform (SVG)
- Convolutional outputs grid (SVG)
Important functions/components:
handleFileChange(...)(inpage.tsx)- The main orchestration function: upload -> base64 -> POST -> set state.
splitLayers(visualization)(inpage.tsx)- Splits
visualizationkeys into:- main layers (no
.) - internal layers grouped under their parent
- main layers (no
- Splits
FeatureMap(src/components/FeatureMap.tsx)- Renders a 2D
valuesarray as an SVG grid usinggetColor().
- Renders a 2D
Waveform(src/components/Waveform.tsx)- Renders waveform samples as an SVG polyline.
Run it (from the audio-cnn-visualization/ directory):
npm install
npm run devThis repo also includes a standalone Streamlit UI that mimics the layout of the Next.js page.
- Entry point:
app.py
Key functions:
_call_inference_endpoint(file_path, endpoint_url)- Reads WAV bytes from disk, base64 encodes, POSTs JSON.
run_visualization(audio_file, endpoint_url)- Parses response fields and builds:
- HTML progress bars for predictions
- PIL image for spectrogram heatmap
- Matplotlib plot for waveform
- HTML grid for conv outputs
- Parses response fields and builds:
Run it:
python app.pyOverride endpoint (optional):
export AUDIO_CNN_ENDPOINT="https://your-endpoint"
python app.pysequenceDiagram
autonumber
participant U as User
participant UI as UI (Next.js or Streamlit)
participant API as Modal FastAPI Endpoint
participant P as Preprocess (AudioProcessor)
participant M as Model (AudioCNN)
U->>UI: Upload .wav
UI->>UI: Read bytes + base64 encode
UI->>API: POST {audio_data: base64}
API->>API: Decode base64
API->>P: mono + resample + mel-spectrogram
P-->>API: spectrogram tensor
API->>M: forward(return_feature_maps=True)
M-->>API: logits + feature_maps
API->>API: softmax + topk(3)
API-->>UI: JSON (predictions, waveform, spectrogram, visualization)
UI-->>U: Render charts + feature maps
model.py- CNN model definition used by training and inference.
train.py- ESC-50 dataset loading + training loop.
- Runs on Modal GPU.
- Saves
best_model.pthto a Modal volume.
main.py- Modal app exposing a FastAPI POST endpoint for inference.
- Loads
best_model.pthfrom the same Modal volume. - Returns predictions + visualization tensors as JSON.
app.py- A Streamlit UI that calls the same inference endpoint as the Next.js UI.
audio-cnn-visualization/- Next.js/Tailwind UI for visualizing the response.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Run Streamlit UI
python app.pycd audio-cnn-visualization
npm install
npm run dev-
Streamlit shows an API error
- Check the endpoint URL and that it is reachable.
- Some hosted Modal endpoints can change unless deployed/stabilized.
-
Slow response
- Feature-map extraction + serialization can be heavy.
- Large audio clips can increase preprocessing time.
-
Audio not 44.1kHz
main.pyresamples to 44100 before generating the spectrogram.
- Start here:
main.pyto understand the live request pipeline. - Then:
model.pyto understand what the feature maps represent. - Then:
train.pyto understand how the checkpoint is produced. - Finally:
page.tsxandapp.pyto see how the visualization is built from the response.
