Advanced unsupervised learning and audio representation project
Environmental Audio Clustering
Unsupervised audio pipeline with handcrafted features and Transformer embeddings
This project explores how far unsupervised learning can go on environmental sound data. It starts with acoustic feature engineering, tests multiple clustering algorithms, and then compares them with Transformer-based audio representations to quantify the gain in semantic grouping.
769
Audio clips
0.601
Best silhouette
0.525
Best DB index
Problem
Environmental audio is hard to label at scale, so a strong unsupervised pipeline needs to recover structure without supervision and stay interpretable enough for audit.
Approach
Extracted MFCC, chroma, spectral, rhythm, and energy features; compared PCA, whitened PCA, and UMAP spaces; benchmarked K-Means, GMM, Spectral Clustering, and HDBSCAN; then repeated clustering with AST, wav2vec2, and HuBERT embeddings.
Results
The best classical pipeline reached a silhouette score of 0.376, while AST plus UMAP plus K-Means reached 0.601 with a Davies-Bouldin index of 0.525, showing a clear representation learning advantage.
What is in the repository
Role and scope
Research pipeline design, feature engineering, benchmarking, and qualitative audit
Project context
Advanced unsupervised learning and audio representation project
Main stack