Advanced unsupervised learning and audio representation project

Environmental Audio Clustering

Unsupervised audio pipeline with handcrafted features and Transformer embeddings

This project explores how far unsupervised learning can go on environmental sound data. It starts with acoustic feature engineering, tests multiple clustering algorithms, and then compares them with Transformer-based audio representations to quantify the gain in semantic grouping.

769

Audio clips

0.601

Best silhouette

0.525

Best DB index

Problem

Environmental audio is hard to label at scale, so a strong unsupervised pipeline needs to recover structure without supervision and stay interpretable enough for audit.

Approach

Extracted MFCC, chroma, spectral, rhythm, and energy features; compared PCA, whitened PCA, and UMAP spaces; benchmarked K-Means, GMM, Spectral Clustering, and HDBSCAN; then repeated clustering with AST, wav2vec2, and HuBERT embeddings.

Results

The best classical pipeline reached a silhouette score of 0.376, while AST plus UMAP plus K-Means reached 0.601 with a Davies-Bouldin index of 0.525, showing a clear representation learning advantage.

What is in the repository

Benchmarked classical features against three Transformer embedding families.

Compared four clustering algorithms across several latent spaces instead of tuning one model in isolation.

Audited cluster quality with both internal metrics and semantic tag inspection.

Produced a clear final comparison between the best classical and best deep representation pipelines.

Role and scope

Research pipeline design, feature engineering, benchmarking, and qualitative audit

Project context

Advanced unsupervised learning and audio representation project

Main stack

Pythonlibrosascikit-learnUMAPTransformersPyTorch