Deep Neural Nets have been used in all sorts of classification tasks, aiding humans in making important decisions or making those decisions themselves altogether.
The distinction between different music genres is one such task on which we can apply deep learning. I decided to start this deep learning project to learn about audio classification and manipulation.
Now here I am sharing my findings with you.
This article will have the following sub-sections;
- Exploring the GTZAN dataset.
- Loading and Augmenting the data.
- Building our model
- Results and conclusion.
Let's get started.
The GTZAN Dataset
The GTZAN dataset is a set of 1000 different audio files 10 different classes of music namely;
- hip hop
It consists of 100 files for each class, each audio clip is 30 seconds long.
As the snapshot above suggests, we have 4 different types of data here.
The ‘genres_original’ folder consists of the original 1000 audio files segregated into different folders based on their genres (label).
The ‘images_original’ folder consists of the images of the Mel-spectrograms of each of these audio files.
The ‘features_30_sec.csv’ and ‘features_3_sec.csv’ each consist of different features extracted from the audio files such as mean and standard deviations of different melspec components and roll-off frequencies.
Let's plot an audio file;
Let's see what the images_original folder contains. We’ll plot one of the images.
The image above shows the pictorial representation of an audio file using its Mel-spectrogram.
Loading the data
Even though we have the spectrograms available to us, I will still apply the augmentations and extract the features myself as well.
The augmentation parameters, as well as the model architecture, has been adapted from the work of Marcharla Vaibhavi and P. Radha Krishna in “Music Genre Classification using Neural Networks with Data Augmentation”
Let's import the required libraries first.
The audiomentations library is an easy-to-use tool that allows us to augment audio files in various ways.
Let's create functions to augment our data.
The add_noise and pitch_shift instances can now be used to augment any audio file.
Now we need to prepare our data for training and testing.
It will be done in the following order: Load Audio files → Split into the test and train elements →Augment the training data →Extract the features from both the Test and Train sets →Encode the labels.
Quite a long journey!
let's start by loading a single file and check how we can extract its mel-spectrogram.
Here we have applied the spectrogram to the same file we loaded previously.
Now we write a routine that loads all the audio files along with their labels.
The arrays X and Y now contain our audio data and their corresponding labels.
Shape of X : (999, 617400)
Shape of Y : (999, 10)
Now we split this data into test and train sets and augment the train set.
from sklearn.model_selection import train_test_split #split the data using the SkLearn library audio_train, audio_test, y_train, y_test = train_test_split(\ X, Y, test_size=0.20, random_state=6)
!! The above code could take several minutes to complete since it is doing a lot of processing on 700 files !!
Now we have the following variables;
X_train →All the training data in Mel-spectrogram form. (Shape = 2397, 128,1206)
Y_train →labels for the training data
But we also need to extract the mel-spec features from the test data.
Notice this time we haven’t augmented data since the test data should remain original for best evaluation.
#converting the test and train data to numpy array X_train = np.stack(X_train) X_test = np.stack(X_test)
There is one last step left. Our labels are still in text form. We need to encode those numbers. Good thing there's a library for everything :D.
Only one last thing left now. The Keras conv2d layer requires us to add an extra dimension to the data. Quite a simple reshaping process.
X_train = X_train.reshape(X_train.shape,X_train.shape,X_train.shape,1) X_test = X_test.reshape(X_test.shape,X_test.shape,X_test.shape,1)
Building the Model
Now that our data is ready, we can move on to building and compiling the model. We will use the Keras API for Tensorflow to construct the model.
As mentioned before, the model architecture has been taken from a research paper, a link to which is added at the end.
Let's code this.
Now we just have to train it.
Training a 2D CNN model takes a long time, I did my processing on a Kaggle notebook. Make sure you have a GPU environment set up before starting the training.
history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test),epochs=50)
Finally, let's plot the accuracies.
Results and Conclusion
The best accuracy we achieved with our current model was 52.5 %. This goes to prove that Music Classification is a tough task. Even with such a complex model and augmented data, we were barely able to cross the 50 % threshold.
The paper that was implemented actually mentions six different augmentation techniques. We have only used 2, perhaps using more of these could improve results. you are free to play around with your experiments.
There are a lot of possibilities to try out on this model. One such is to shift from a CNN model to an RCNN. Other techniques could be to try extracting features other than the Mel-Spectrogram.
If you find any of these new techniques, do mention them down in the comments.