Music Genre Classification Using Deep Learning with Keras | Python

Deep Neural Nets have been used in all sorts of classification tasks, aiding humans in making important decisions or making those decisions themselves altogether.

The distinction between different music genres is one such task on which we can apply deep learning. I decided to start this deep learning project to learn about audio classification and manipulation.

Now here I am sharing my findings with you.

This article will have the following sub-sections;

  • Exploring the GTZAN dataset.
  • Loading and Augmenting the data.
  • Building our model
  • Results and conclusion.

Let’s get started.

The GTZAN Dataset

The GTZAN dataset is a set of 1000 different audio files 10 different classes of music namely;

  • blues
  • classical
  • country
  • disco
  • hip hop
  • jazz
  • metal
  • pop
  • reggae
  • rock

It consists of 100 files for each class, each audio clip is 30 seconds long.

Directory structure for the dataset
The directory structure of the dataset.

As the snapshot above suggests, we have 4 different types of data here.

The ‘genres_original’ folder consists of the original 1000 audio files segregated into different folders based on their genres (label).

The ‘images_original’ folder consists of the images of the Mel-spectrograms of each of these audio files.

The ‘features_30_sec.csv’ and ‘features_3_sec.csv’ each consist of different features extracted from the audio files such as mean and standard deviations of different melspec components and roll-off frequencies.

Let’s plot an audio file;

Audio sample plotted
Audio Sample Plotted

Let’s see what the images_original folder contains. We’ll plot one of the images.

Mel-Spectrogram for a Jazz music file
Mel-Spectrogram for a Jazz music file

The image above shows the pictorial representation of an audio file using its Mel-spectrogram.

Loading the data

Even though we have the spectrograms available to us, I will still apply the augmentations and extract the features myself as well.

The augmentation parameters, as well as the model architecture, has been adapted from the work of Marcharla Vaibhavi and P. Radha Krishna in “Music Genre Classification using Neural Networks with Data Augmentation

Let’s import the required libraries first.

from audiomentations import Compose, AddGaussianNoise, TimeStretch, PitchShift, Shift import numpy as np import pandas as pd import matplotlib.pyplot as plt import librosa from tqdm import tqdm import os
Code language: JavaScript (javascript)

The audiomentations library is an easy-to-use tool that allows us to augment audio files in various ways.

Let’s create functions to augment our data.

add_noise = Compose([ AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.7), ]) pitch_shift = Compose([ PitchShift(min_semitones=-4, max_semitones=12, p=0.5), ])

The add_noise and pitch_shift instances can now be used to augment any audio file.

Now we need to prepare our data for training and testing.

It will be done in the following order: Load Audio files → Split into the test and train elements →Augment the training data →Extract the features from both the Test and Train sets →Encode the labels.

Quite a long journey!

let’s start by loading a single file and check how we can extract its mel-spectrogram.

import librosa.display #setting melspec features n_mels = 128 hop_length = 512 n_fft = 1024 #extract melspec features using librosa S = librosa.feature.melspectrogram(sample, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels) #convert it to DB scale S_DB = librosa.power_to_db(S, ref=np.max) #display the spectrogram librosa.display.specshow(S_DB, sr=sr, hop_length=hop_length, x_axis='time', y_axis='mel'); plt.colorbar(format='%+2.0f dB');
Code language: PHP (php)

Here we have applied the spectrogram to the same file we loaded previously.


Now we write a routine that loads all the audio files along with their labels.

#temporary list for the input data data = [] #list to append all the labels Y = [] base_path = '../input/gtzan-dataset-music-genre-classification/Data/genres_original' #looping through all label directories for label in tqdm(os.listdir(base_path)): file_path = base_path + '/' + label #looping through each file in the directory for pth in os.listdir(file_path): try: final_path = file_path + '/' + pth #loading original file audio, sr = librosa.load(final_path,duration = 28) #appending data to a list data.append(audio) #appending labels to the label list Y.append(label) except: print("Error in file", pth) pass #converting list to a numpy array X = np.stack(data)
Code language: PHP (php)

The arrays X and Y now contain our audio data and their corresponding labels.

Shape of X : (999, 617400)

Shape of Y : (999, 10)

Now we split this data into test and train sets and augment the train set.

from sklearn.model_selection import train_test_split #split the data using the SkLearn library audio_train, audio_test, y_train, y_test = train_test_split( X, Y, test_size=0.20, random_state=6)
Code language: PHP (php)
def get_melspec(audio, sr = sr, n_fft = n_fft, hop_length = hop_length, n_mels = n_mels): #calculate the melspectogram of the provided audio wave S = librosa.feature.melspectrogram(audio, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels) return S
Code language: PHP (php)
#temporary list for the input data X_train = [] #list to append all the labels Y_train = [] #looping through train data to create melspec and augment data for i, dat in tqdm(enumerate(audio_train)): try: #adding noise to the file noisy_audio = add_noise(dat ,sr) #changing pitch of the audio pitch_audio = pitch_shift(dat, sr) #generate melspec for original and augmented files mel = get_melspec(dat) noise_mel = get_melspec(noisy_audio) pitch_mel = get_melspec(pitch_audio) #appending augmented data to original training data X_train.append(mel) Y_train.append(y_train[i]) X_train.append(noise_mel) Y_train.append(y_train[i]) X_train.append(pitch_mel) Y_train.append(y_train[i]) except Exception as e: print("Error in file:", pth) print("Error:", e)
Code language: PHP (php)

!! The above code could take several minutes to complete since it is doing a lot of processing on 700 files !!

Now we have the following variables;

X_train →All the training data in Mel-spectrogram form. (Shape = 2397, 128,1206)

Y_train →labels for the training data

But we also need to extract the mel-spec features from the test data.

#temporary list for the input data X_test = [] #list to append all the labels Y_test = [] #looping through train data to create melspec and augment data for i, dat in tqdm(enumerate(audio_test)): try: #generate melspec for original and augmented files mel = get_melspec(dat) #Appending test melspec to list X_test.append(mel) Y_test.append(y_test[i]) except Exception as e: print("Error in file:", pth) print("Error:", e)
Code language: PHP (php)

Notice this time we haven’t augmented data since the test data should remain original for best evaluation.

#converting the test and train data to numpy array X_train = np.stack(X_train) X_test = np.stack(X_test)
Code language: PHP (php)

There is one last step left. Our labels are still in text form. We need to encode those numbers. Good thing there’s a library for everything :D.

from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() Y_train = encoder.transform(Y_train).reshape([len(Y_train), 1]) encoder = LabelEncoder() Y_test = encoder.transform(Y_test).reshape([len(Y_test), 1])
Code language: JavaScript (javascript)

Only one last thing left now. The Keras conv2d layer requires us to add an extra dimension to the data. Quite a simple reshaping process.

X_train = X_train.reshape(X_train.shape[0],X_train.shape[1],X_train.shape[2],1) X_test = X_test.reshape(X_test.shape[0],X_test.shape[1],X_test.shape[2],1)

Building the Model

Now that our data is ready, we can move on to building and compiling the model. We will use the Keras API for Tensorflow to construct the model.

As mentioned before, the model architecture has been taken from a research paper, a link to which is added at the end.

Model Architecture

Let’s code this.

#importing the keras modules from keras.models import Sequential from keras.layers import Dense, Dropout, Conv2D, MaxPooling2D, Flatten, GRU from keras.callbacks import Callback, EarlyStopping #Initiating the model as Sequential model = Sequential() #Adding the CNN layers along with some drop outs and maxpooling model.add(Conv2D(64, 2, activation = 'relu', input_shape = (X_train.shape[1:]))) model.add(MaxPooling2D(pool_size = (2,2))) model.add(Dropout(0.1)) model.add(Conv2D(128, 2, activation = 'relu')) model.add(MaxPooling2D(pool_size = (2,2))) model.add(Dropout(0.1)) model.add(Conv2D(256, 2, activation = 'relu')) model.add(MaxPooling2D(pool_size = (4,4))) model.add(Dropout(0.1)) model.add(Conv2D(512, 2, activation = 'relu')) model.add(MaxPooling2D(pool_size = (8,8))) model.add(Dropout(0.1)) #flattening the data to be passed to a dense layer model.add(Flatten()) #Adding the dense layers model.add(Dense(2048, activation = 'relu')) model.add(Dense(1024, activation = 'relu')) model.add(Dense(256, activation = 'relu')) #final output layer with 10 predictions to be made model.add(Dense(10, activation = 'softmax')) ''' Optimizer = Adam Loss = Sparse Categorical CrossEntropy ''' model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) model.summary()
Code language: PHP (php)
Model Summary
Model Summary

Now we just have to train it.

Training a 2D CNN model takes a long time, I did my processing on a Kaggle notebook. Make sure you have a GPU environment set up before starting the training.

history =, Y_train, validation_data=(X_test, Y_test),epochs=50)

Finally, let’s plot the accuracies.

plt.plot(history.history['accuracy']) plt.plot(history.history['val_accuracy']) plt.title('model accuracy') plt.ylabel('accuracy') plt.xlabel('epoch') plt.legend(['train', 'val'], loc='upper left')
Code language: JavaScript (javascript)

Results and Conclusion

Train and Validation accuracy
Train and Validation Accuracy

The best accuracy we achieved with our current model was 52.5 %. This goes to prove that Music Classification is a tough task. Even with such a complex model and augmented data, we were barely able to cross the 50 % threshold.

The paper that was implemented actually mentions six different augmentation techniques. We have only used 2, perhaps using more of these could improve results. you are free to play around with your experiments.

Final Thoughts

There are a lot of possibilities to try out on this model. One such is to shift from a CNN model to an RCNN. Other techniques could be to try extracting features other than the Mel-Spectrogram.

If you find any of these new techniques, do mention them down in the comments.

Paper Reference: Krishna, Macharla Vaibhavi P. Radha. “Music Genre Classification using Neural Networks with Data Augmentation.” (2021).

Comments 2
Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Implementing K Means Clustering with K Means++ Initialization | Python.

Implementing K Means Clustering with K Means++ Initialization | Python.

K-Means clustering is an unsupervised machine learning algorithm

MPV Configuration Guide to upscale and enhance your Anime experience
woman in black shirt taking selfie

MPV Configuration Guide to upscale and enhance your Anime experience

I am a devoted anime fan and usually spend most of my free time picking up a new

You May Also Like