Deep Neural Nets have been used in all sorts of classification tasks, aiding humans in making important decisions or making those decisions themselves altogether.
The distinction between different music genres is one such task on which we can apply deep learning. I decided to start this deep learning project to learn about audio classification and manipulation.
Now here I am sharing my findings with you.
This article will have the following sub-sections;
- Exploring the GTZAN dataset.
- Loading and Augmenting the data.
- Building our model
- Results and conclusion.
Let’s get started.
The GTZAN Dataset
The GTZAN dataset is a set of 1000 different audio files 10 different classes of music namely;
- hip hop
It consists of 100 files for each class, each audio clip is 30 seconds long.
As the snapshot above suggests, we have 4 different types of data here.
The ‘genres_original’ folder consists of the original 1000 audio files segregated into different folders based on their genres (label).
The ‘images_original’ folder consists of the images of the Mel-spectrograms of each of these audio files.
The ‘features_30_sec.csv’ and ‘features_3_sec.csv’ each consist of different features extracted from the audio files such as mean and standard deviations of different melspec components and roll-off frequencies.
Let’s plot an audio file;
Let’s see what the images_original folder contains. We’ll plot one of the images.
The image above shows the pictorial representation of an audio file using its Mel-spectrogram.
Loading the data
Even though we have the spectrograms available to us, I will still apply the augmentations and extract the features myself as well.
The augmentation parameters, as well as the model architecture, has been adapted from the work of Marcharla Vaibhavi and P. Radha Krishna in “Music Genre Classification using Neural Networks with Data Augmentation”
Let’s import the required libraries first.
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift import numpy as np import pandas as pd import matplotlib.pyplot as plt import librosa from tqdm import tqdm import os
The audiomentations library is an easy-to-use tool that allows us to augment audio files in various ways.
Let’s create functions to augment our data.
add_noise = Compose([ AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.7), ]) pitch_shift = Compose([ PitchShift(min_semitones=-4, max_semitones=12, p=0.5), ])
The add_noise and pitch_shift instances can now be used to augment any audio file.
Now we need to prepare our data for training and testing.
It will be done in the following order: Load Audio files → Split into the test and train elements →Augment the training data →Extract the features from both the Test and Train sets →Encode the labels.
Quite a long journey!
let’s start by loading a single file and check how we can extract its mel-spectrogram.
import librosa.display #setting melspec features n_mels = 128 hop_length = 512 n_fft = 1024 #extract melspec features using librosa S = librosa.feature.melspectrogram(sample, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels) #convert it to DB scale S_DB = librosa.power_to_db(S, ref=np.max) #display the spectrogram librosa.display.specshow(S_DB, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel'); plt.colorbar(format='%+2.0f dB');
Here we have applied the spectrogram to the same file we loaded previously.
Now we write a routine that loads all the audio files along with their labels.
#temporary list for the input data data =  #list to append all the labels Y =  base_path = '../input/gtzan-dataset-music-genre-classification/Data/genres_original' #looping through all label directories for label in tqdm(os.listdir(base_path)): file_path = base_path + '/' + label #looping through each file in the directory for pth in os.listdir(file_path): try: final_path = file_path + '/' + pth #loading original file audio, sr = librosa.load(final_path,duration = 28) #appending data to a list data.append(audio) #appending labels to the label list Y.append(label) except: print("Error in file", pth) pass #converting list to a numpy array X = np.stack(data)
The arrays X and Y now contain our audio data and their corresponding labels.
Shape of X : (999, 617400)
Shape of Y : (999, 10)
Now we split this data into test and train sets and augment the train set.
from sklearn.model_selection import train_test_split #split the data using the SkLearn library audio_train, audio_test, y_train, y_test = train_test_split( X, Y, test_size=0.20, random_state=6)
def get_melspec(audio, sr = sr, n_fft = n_fft, hop_length = hop_length, n_mels = n_mels): #calculate the melspectogram of the provided audio wave S = librosa.feature.melspectrogram(audio, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels) return S
#temporary list for the input data X_train =  #list to append all the labels Y_train =  #looping through train data to create melspec and augment data for i, dat in tqdm(enumerate(audio_train)): try: #adding noise to the file noisy_audio = add_noise(dat ,sr) #changing pitch of the audio pitch_audio = pitch_shift(dat, sr) #generate melspec for original and augmented files mel = get_melspec(dat) noise_mel = get_melspec(noisy_audio) pitch_mel = get_melspec(pitch_audio) #appending augmented data to original training data X_train.append(mel) Y_train.append(y_train[i]) X_train.append(noise_mel) Y_train.append(y_train[i]) X_train.append(pitch_mel) Y_train.append(y_train[i]) except Exception as e: print("Error in file:", pth) print("Error:", e)
!! The above code could take several minutes to complete since it is doing a lot of processing on 700 files !!
Now we have the following variables;
X_train →All the training data in Mel-spectrogram form. (Shape = 2397, 128,1206)
Y_train →labels for the training data
But we also need to extract the mel-spec features from the test data.
#temporary list for the input data X_test =  #list to append all the labels Y_test =  #looping through train data to create melspec and augment data for i, dat in tqdm(enumerate(audio_test)): try: #generate melspec for original and augmented files mel = get_melspec(dat) #Appending test melspec to list X_test.append(mel) Y_test.append(y_test[i]) except Exception as e: print("Error in file:", pth) print("Error:", e)
Notice this time we haven’t augmented data since the test data should remain original for best evaluation.
#converting the test and train data to numpy array X_train = np.stack(X_train) X_test = np.stack(X_test)
There is one last step left. Our labels are still in text form. We need to encode those numbers. Good thing there’s a library for everything :D.
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() encoder.fit(Y_train) Y_train = encoder.transform(Y_train).reshape([len(Y_train), 1]) encoder = LabelEncoder() encoder.fit(Y_test) Y_test = encoder.transform(Y_test).reshape([len(Y_test), 1])
Only one last thing left now. The Keras conv2d layer requires us to add an extra dimension to the data. Quite a simple reshaping process.
X_train = X_train.reshape(X_train.shape,X_train.shape,X_train.shape,1) X_test = X_test.reshape(X_test.shape,X_test.shape,X_test.shape,1)
Building the Model
Now that our data is ready, we can move on to building and compiling the model. We will use the Keras API for Tensorflow to construct the model.
As mentioned before, the model architecture has been taken from a research paper, a link to which is added at the end.
Let’s code this.
#importing the keras modules from keras.models import Sequential from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten, GRU from keras.callbacks import Callback, EarlyStopping #Initiating the model as Sequential model = Sequential() #Adding the CNN layers along with some drop outs and maxpooling model.add(Conv2D(64, 2, activation = 'relu', input_shape = (X_train.shape[1:]))) model.add(MaxPooling2D(pool_size = (2,2))) model.add(Dropout(0.1)) model.add(Conv2D(128, 2, activation = 'relu')) model.add(MaxPooling2D(pool_size = (2,2))) model.add(Dropout(0.1)) model.add(Conv2D(256, 2, activation = 'relu')) model.add(MaxPooling2D(pool_size = (4,4))) model.add(Dropout(0.1)) model.add(Conv2D(512, 2, activation = 'relu')) model.add(MaxPooling2D(pool_size = (8,8))) model.add(Dropout(0.1)) #flattening the data to be passed to a dense layer model.add(Flatten()) #Adding the dense layers model.add(Dense(2048, activation = 'relu')) model.add(Dense(1024, activation = 'relu')) model.add(Dense(256, activation = 'relu')) #final output layer with 10 predictions to be made model.add(Dense(10, activation = 'softmax')) ''' Optimizer = Adam Loss = Sparse Categorical CrossEntropy ''' model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.summary()
Now we just have to train it.
Training a 2D CNN model takes a long time, I did my processing on a Kaggle notebook. Make sure you have a GPU environment set up before starting the training.
history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test),epochs=50)
Finally, let’s plot the accuracies.
plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'val'], loc='upper left') plt.show()
Results and Conclusion
The best accuracy we achieved with our current model was 52.5 %. This goes to prove that Music Classification is a tough task. Even with such a complex model and augmented data, we were barely able to cross the 50 % threshold.
The paper that was implemented actually mentions six different augmentation techniques. We have only used 2, perhaps using more of these could improve results. you are free to play around with your experiments.
There are a lot of possibilities to try out on this model. One such is to shift from a CNN model to an RCNN. Other techniques could be to try extracting features other than the Mel-Spectrogram.
If you find any of these new techniques, do mention them down in the comments.