Deep Neural Nets have been used in all sorts of classification tasks, aiding humans in making important decisions or making those decisions themselves altogether.
The distinction between different music genres is one such task on which we can apply deep learning. I decided to start this deep learning project to learn about audio classification and manipulation.
Now here I am sharing my findings with you.
This article will have the following sub-sections;
- Exploring the GTZAN dataset.
- Loading and Augmenting the data.
- Building our model
- Results and conclusion.
Let’s get started.
The GTZAN Dataset
The GTZAN dataset is a set of 1000 different audio files 10 different classes of music namely;
- blues
- classical
- country
- disco
- hip hop
- jazz
- metal
- pop
- reggae
- rock
It consists of 100 files for each class, each audio clip is 30 seconds long.

As the snapshot above suggests, we have 4 different types of data here.
The ‘genres_original’ folder consists of the original 1000 audio files segregated into different folders based on their genres (label).
The ‘images_original’ folder consists of the images of the Mel-spectrograms of each of these audio files.
The ‘features_30_sec.csv’ and ‘features_3_sec.csv’ each consist of different features extracted from the audio files such as mean and standard deviations of different melspec components and roll-off frequencies.
Let’s plot an audio file;

Let’s see what the images_original folder contains. We’ll plot one of the images.

The image above shows the pictorial representation of an audio file using its Mel-spectrogram.
Loading the data
Even though we have the spectrograms available to us, I will still apply the augmentations and extract the features myself as well.
The augmentation parameters, as well as the model architecture, has been adapted from the work of Marcharla Vaibhavi and P. Radha Krishna in “Music Genre Classification using Neural Networks with Data Augmentation”
Let’s import the required libraries first.
from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import librosa
from tqdm import tqdm
import os
Code language: JavaScript (javascript)
The audiomentations library is an easy-to-use tool that allows us to augment audio files in various ways.
Let’s create functions to augment our data.
add_noise = Compose([ AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.7), ]) pitch_shift = Compose([ PitchShift(min_semitones=-4, max_semitones=12, p=0.5), ])
The add_noise and pitch_shift instances can now be used to augment any audio file.
Now we need to prepare our data for training and testing.
It will be done in the following order: Load Audio files → Split into the test and train elements →Augment the training data →Extract the features from both the Test and Train sets →Encode the labels.
Quite a long journey!

let’s start by loading a single file and check how we can extract its mel-spectrogram.
import librosa.display
#setting melspec features
n_mels = 128
hop_length = 512
n_fft = 1024
#extract melspec features using librosa
S = librosa.feature.melspectrogram(sample, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
#convert it to DB scale
S_DB = librosa.power_to_db(S, ref=np.max)
#display the spectrogram
librosa.display.specshow(S_DB, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel');
plt.colorbar(format='%+2.0f dB');
Code language: PHP (php)
Here we have applied the spectrogram to the same file we loaded previously.

Now we write a routine that loads all the audio files along with their labels.
#temporary list for the input data
data = []
#list to append all the labels
Y = []
base_path = '../input/gtzan-dataset-music-genre-classification/Data/genres_original'
#looping through all label directories
for label in tqdm(os.listdir(base_path)):
file_path = base_path + '/' + label
#looping through each file in the directory
for pth in os.listdir(file_path):
try:
final_path = file_path + '/' + pth
#loading original file
audio, sr = librosa.load(final_path,duration = 28)
#appending data to a list
data.append(audio)
#appending labels to the label list
Y.append(label)
except:
print("Error in file", pth)
pass
#converting list to a numpy array
X = np.stack(data)
Code language: PHP (php)
The arrays X and Y now contain our audio data and their corresponding labels.
Shape of X : (999, 617400)
Shape of Y : (999, 10)
Now we split this data into test and train sets and augment the train set.
from sklearn.model_selection import train_test_split
#split the data using the SkLearn library
audio_train, audio_test, y_train, y_test = train_test_split(
X, Y, test_size=0.20, random_state=6)
Code language: PHP (php)
def get_melspec(audio, sr = sr, n_fft = n_fft, hop_length = hop_length, n_mels = n_mels):
#calculate the melspectogram of the provided audio wave
S = librosa.feature.melspectrogram(audio, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels)
return S
Code language: PHP (php)
#temporary list for the input data
X_train = []
#list to append all the labels
Y_train = []
#looping through train data to create melspec and augment data
for i, dat in tqdm(enumerate(audio_train)):
try:
#adding noise to the file
noisy_audio = add_noise(dat ,sr)
#changing pitch of the audio
pitch_audio = pitch_shift(dat, sr)
#generate melspec for original and augmented files
mel = get_melspec(dat)
noise_mel = get_melspec(noisy_audio)
pitch_mel = get_melspec(pitch_audio)
#appending augmented data to original training data
X_train.append(mel)
Y_train.append(y_train[i])
X_train.append(noise_mel)
Y_train.append(y_train[i])
X_train.append(pitch_mel)
Y_train.append(y_train[i])
except Exception as e:
print("Error in file:", pth)
print("Error:", e)
Code language: PHP (php)
!! The above code could take several minutes to complete since it is doing a lot of processing on 700 files !!
Now we have the following variables;
X_train →All the training data in Mel-spectrogram form. (Shape = 2397, 128,1206)
Y_train →labels for the training data
But we also need to extract the mel-spec features from the test data.
#temporary list for the input data
X_test = []
#list to append all the labels
Y_test = []
#looping through train data to create melspec and augment data
for i, dat in tqdm(enumerate(audio_test)):
try:
#generate melspec for original and augmented files
mel = get_melspec(dat)
#Appending test melspec to list
X_test.append(mel)
Y_test.append(y_test[i])
except Exception as e:
print("Error in file:", pth)
print("Error:", e)
Code language: PHP (php)
Notice this time we haven’t augmented data since the test data should remain original for best evaluation.
#converting the test and train data to numpy array
X_train = np.stack(X_train)
X_test = np.stack(X_test)
Code language: PHP (php)
There is one last step left. Our labels are still in text form. We need to encode those numbers. Good thing there’s a library for everything :D.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(Y_train)
Y_train = encoder.transform(Y_train).reshape([len(Y_train), 1])
encoder = LabelEncoder()
encoder.fit(Y_test)
Y_test = encoder.transform(Y_test).reshape([len(Y_test), 1])
Code language: JavaScript (javascript)
Only one last thing left now. The Keras conv2d layer requires us to add an extra dimension to the data. Quite a simple reshaping process.
X_train = X_train.reshape(X_train.shape[0],X_train.shape[1],X_train.shape[2],1) X_test = X_test.reshape(X_test.shape[0],X_test.shape[1],X_test.shape[2],1)
Building the Model
Now that our data is ready, we can move on to building and compiling the model. We will use the Keras API for Tensorflow to construct the model.
As mentioned before, the model architecture has been taken from a research paper, a link to which is added at the end.

Let’s code this.
#importing the keras modules
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten, GRU
from keras.callbacks import Callback, EarlyStopping
#Initiating the model as Sequential
model = Sequential()
#Adding the CNN layers along with some drop outs and maxpooling
model.add(Conv2D(64, 2, activation = 'relu', input_shape = (X_train.shape[1:])))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Dropout(0.1))
model.add(Conv2D(128, 2, activation = 'relu'))
model.add(MaxPooling2D(pool_size = (2,2)))
model.add(Dropout(0.1))
model.add(Conv2D(256, 2, activation = 'relu'))
model.add(MaxPooling2D(pool_size = (4,4)))
model.add(Dropout(0.1))
model.add(Conv2D(512, 2, activation = 'relu'))
model.add(MaxPooling2D(pool_size = (8,8)))
model.add(Dropout(0.1))
#flattening the data to be passed to a dense layer
model.add(Flatten())
#Adding the dense layers
model.add(Dense(2048, activation = 'relu'))
model.add(Dense(1024, activation = 'relu'))
model.add(Dense(256, activation = 'relu'))
#final output layer with 10 predictions to be made
model.add(Dense(10, activation = 'softmax'))
'''
Optimizer = Adam
Loss = Sparse Categorical CrossEntropy
'''
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()
Code language: PHP (php)

Now we just have to train it.
Training a 2D CNN model takes a long time, I did my processing on a Kaggle notebook. Make sure you have a GPU environment set up before starting the training.
history = model.fit(X_train, Y_train, validation_data=(X_test, Y_test),epochs=50)
Finally, let’s plot the accuracies.
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()
Code language: JavaScript (javascript)
Results and Conclusion

The best accuracy we achieved with our current model was 52.5 %. This goes to prove that Music Classification is a tough task. Even with such a complex model and augmented data, we were barely able to cross the 50 % threshold.
The paper that was implemented actually mentions six different augmentation techniques. We have only used 2, perhaps using more of these could improve results. you are free to play around with your experiments.
Final Thoughts
There are a lot of possibilities to try out on this model. One such is to shift from a CNN model to an RCNN. Other techniques could be to try extracting features other than the Mel-Spectrogram.
If you find any of these new techniques, do mention them down in the comments.
Paper Reference: Krishna, Macharla Vaibhavi P. Radha. “Music Genre Classification using Neural Networks with Data Augmentation.” (2021).