Hi everyone! I recently finished a project that allowed me to thoroughly explore the intersection between signal processing and **deep learning.
The goal was clear: to create an AI capable of classifying music genres, not by analyzing the audio directly, but by “teaching it to see” the music through spectrograms.
️ How did I do it? The workflow:
- Data Engineering: I used the GTZAN dataset and developed my own script based on yt-dlp to expand the corpus with contemporary genres such as Trap, Lofi, and Reggaeton.

Critical Preprocessing: I implemented a pipeline with Librosa to normalize the audio, trim silences, and transform the waveforms into 128/128px Mel spectrograms. The music was officially converted into RGB images!

CNN Architecture: I designed a Convolutional Neural Network (CNN) with Conv2D layers, MaxPooling to reduce dimensionality, and Dropout to prevent overfitting.

Results: I achieved an overall accuracy of 80%. The most satisfying part was seeing how the model reached an F1-Score of 1.00 for the Lofi genre, validating the quality of the custom dataset.

Lessons learned:
It’s not always about having the most complex model. I learned that data cleaning and standardization (such as precisely trimming clips to 30 seconds) are just as critical as the network architecture.
Although I initially planned to do fine-tuning, I decided to first consolidate the base architecture to ensure solid learning of the fundamental features of the new genres. We’re still learning!
5Posts
3Comments
16Followers
16Connections
Passionate Software Developer (DAM) currently specializing in AI and Big Data. I enjoy building robust backends with Java and Spring Boot, while exploring the world of data processing, machine learning models, and audio analysis with Python."
✨ Build your own developer journey
Track progress. Share learning. Stay consistent.