Classify music genre from a 10 second sound stream using a Neural Network.

Overview

MusicGenreClassification

MusicGenreClassification

Academic research in the field of Deep Learning (Deep Neural Networks) and Sound Processing, Tel Aviv University.

Featured in Medium.

Abstract

This paper discuss the task of classifying the music genre of a sound sample.

Introduction

When I decided to work on the field of sound processing I thought that genre classification is a parallel problem to the image classification. To my surprise I did not found too many works in deep learning that tackled this exact problem. One paper that did tackle this classification problem is Tao Feng’s paper [1] from the university of Illinois. I did learned a lot from this paper, but honestly, they results the paper presented were not impressive.

So I had to look on other, related but not exact papers. A very influential paper was Deep content-based music recommendation [2] This paper is about content-base music recommendation using deep learning techniques. The way they got the dataset, and the preprocessing they had done to the sound had really enlightened my implementation. Also, this paper was mentioned lately on “Spotify” blog [3]. Spotify recruited a deep learning intern that based on the above work implemented a music recommendation engine. His simple yet very efficient network made me think that Tao’s RBM was not the best approach and there for my implementation included a CNN instead like in the Spotify blog. One very important note is that Tao’s work published result only for 2,3 and 4 classes classification. Obviously he got really good result for 2 classes classification, but the more classes he tried to classify the poorer the result he got. My work classify the whole 10 classes challenge, a much more difficult task. A sub task for this project was to learn a new SDK for deep learning, I have been waiting for an opportunity to learn Google’s new TensorFlow[4]. This project is implemented in Python and the Machine Learning part is using TensorFlow.

The Dataset

Getting the dataset might be the most time consuming part of this work. Working with music is a big pain, every file is usually a couple of MBs, there are variety of qualities and parameters of recording (Number of frequencies, Bits per second, etc…). But the biggest pain is copyrighting, there are no legit famous songs dataset as they would cost money. Tao’s paper based on a dataset called GTZAN[5]. This dataset is quit small (100 songs per genre X 10 genres = overall 1,000 songs), and the copyright permission is questionable. This is from my perspective one of the reasons that held him from getting better results. So, I looked up for generating more data to learn from. Eventually I found MSD[6] dataset (Million Song Dataset). It is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. Around 280 GB of pure metadata. There is a project on top of MSD called tagtraum[7] which classify MSD songs into genres. The problem now was to get the sound itself, here is where I got a little creative. I found that one of the tags every song have in the dataset is an id from a provider called 7Digital[8]. 7Digital is a SaaS provider for music application, it basically let you stream music for money. I signed up to 7Digital as a developer and after their approval i could access their API. Still any song stream costs money, But I found out that they are enabling to preview random 30 seconds of a song to the user before paying for them. This is more than enough for my deep learning task, So I wrote “previewDownloader.py” that downloads for every song in the MSD dataset a 30 sec preview. Unfortunately I had only my laptop for this mission, so I had to settle with only 1% of the dataset (around 2.8GB).

The genres I am classifying are:

  1. blues
  2. classical
  3. country
  4. disco
  5. hiphop
  6. jazz
  7. metal
  8. pop
  9. reggae
    10.rock

Music genre popularity

Preprocessing the data

Having a big data set isn't enough, in oppose to image tasks I cannot work straight on the raw sound sample, a quick calculation: 30 seconds × 22050 sample/sec- ond = 661500 length of vector, which would be heavy load for a convention machine learning method.

Following all the papers I read and researching a little on acoustic analysis, It is quit obvious that the industry is using Mel-frequency cepstral coefficients (MFCC) as the feature vector for the sound sample, I used librosa[9] implementation.

MFCCs are derived as follows:

  1. Take the Fourier transform of (a windowed excerpt of) a signal.
  2. Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
  3. Take the logs of the powers at each of the mel frequencies.
  4. Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
  5. The MFCCs are the amplitudes of the resulting spectrum.

I had tried several window size and stride values, the best result I got was for size of 100ms and a stride of 40ms.

One more point was that Tao’s paper used MFCC features (step 5) while Sander used strait mel-frequencies (step 2).

MEL ppower over time

I tried both approaches and found out that I got extremely better results using just the mel-frequencies, but the trade-off was the training time of-course. Before continue to building a network I wanted to visualise the preprocessed data set, I implemented this through the t-SNE[10] algorithm.Below you can see the t-SNE graph for MFCC (step 5) and Mel-Frequencies (step 2):

t-SNE MFCC samples as genres

t-SNE mel-spectogram samples as genres

The Graph

After seeing the results Tao and Sander reached I decided to go with a convolu- tional neural network implementation. The network receive a 599 vector of mea-frequen- cy beans, each containing 128 frequencies which describe their window. The network consist with 3 hidden layers and between them I am doing a max pooling. Finally a fully connected layer and than softmax to end up with a 10 dimensional vector for our ten genre classes

Nural Network

I did implement another network for MFCC feature instead of mel-frequencies, the only differences are in the sizes (13 frequencies per window instead of 128).

Visualisation of various filters (from Sander’s paper):

Filters visualization

• Filter 14 seems to pick up vibrato singing. • Filter 242 picks up some kind of ringing ambience. • Filter 250 picks up vocal thirds, i.e. multiple singers singing the same thing, but the notes are a major third (4 semitones) apart. • Filter 253 picks up various types of bass drum sounds.

Results

As I explained in the introduction, the papers I based my work on did not solve the exact problem I did, for example Tao’s paper published results for classifying 2,3 and 4 classes (Genres).

Tao Feng's results

I did looked for benchmarks outside the deep learning field and I found a paper titled “A BENCHMARK DATASET FOR AUDIO CLASSIFICATION AND CLUSTERING” [11]. This paper benchmark a very similar task to mine, the genres it classifies: Blues, Electronic, Jazz, Pop, HipHop, Rock, Folk, Alternative, Funk.

Benchmark results

My results:

My results

Code

Documentation

• previewDownloader.py: USAGE: python previewDownloader.py [path to MSD data] This script iterate over all ‘.h5’ in a directory and download a 30 seconds sample from 7digital.

• preproccess.py: USAGE: python preproccess.py [path to MSD mp3 data] This script pre-processing the sound files. Calculating MFCC for a sliding window and saving the result in a ‘.pp’ file.

• formatInput.py: USAGE: python formatInput.py [path to MSD pp data] The script iterates over all ‘.pp’ files and generates ‘data’ and ‘labels’ that will be used as an input to the NN. Moreover, the script output a t-SNE graph at the end.

• train.py: USAGE: python train.py This script builds the neural network and feeds it with ‘data’ and ‘labels’. When it is done it will save ‘model.final’.

Complete Installation

References

[1] Tao Feng, Deep learning for music genre classification, University of Illinois. https://courses.engr.illinois.edu/ece544na/fa2014/Tao_Feng.pdf [2]Aar̈onvandenOord,SanderDieleman,BenjaminSchrauwen,Deepcontent- based music recommendation. http://papers.nips.cc/paper/5004-deep-content-based- music-recommendation.pdf [3] SANDER DIELEMAN, RECOMMENDING MUSIC ON SPOTIFY WITH DEEP LEARNING, AUGUST 05, 2014. http://benanne.github.io/2014/08/05/spotify-cnns.html [4] https://www.tensorflow.org [5] GTZAN Genre Collection. http://marsyasweb.appspot.com/download/ data_sets/ [6] Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011. http:// labrosa.ee.columbia.edu/millionsong/ [7] Hendrik Schreiber. Improving genre annotations for the million song dataset. In Proceedings of the 16th International Conference on Music Information Retrieval (IS- MIR), pages 241-247, 2015. http://www.tagtraum.com/msd_genre_datasets.html [8] https://www.7digital.com [9] https://github.com/bmcfee/librosa [10] http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html [11] Helge Homburg, Ingo Mierswa, Bu l̈ent Mo l̈ler, Katharina Morik and Michael Wurst, A BENCHMARK DATASET FOR AUDIO CLASSIFICATION AND CLUSTERING, University of Dortmund, AI Unit. http://sfb876.tu-dortmund.de/PublicPublicationFiles/ homburg_etal_2005a.pdf

Author

Matan Lachmish a.k.a The Big Fat Ninja The Big Fat Ninja
https://thebigfatninja.xyz

attribution

Icon made by Freepik from www.flaticon.com

License

MusicGenreClassification is available under the MIT license. See the LICENSE file for more info.

Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph

NIRPS-ETC Exposure Time Calculator (ETC) and radial velocity precision estimator for the Near InfraRed Planet Searcher (NIRPS) spectrograph February 2

Nolan Grieves 2 Sep 15, 2022
Protect against subdomain takeover

domain-protect scans Amazon Route53 across an AWS Organization for domain records vulnerable to takeover deploy to security audit account scan your en

OVO Technology 0 Nov 17, 2022
An open source object detection toolbox based on PyTorch

MMDetection is an open source object detection toolbox based on PyTorch. It is a part of the OpenMMLab project.

Bo Chen 24 Dec 28, 2022
ProjectOxford-ClientSDK - This repo has moved :house: Visit our website for the latest SDKs & Samples

This project has moved 🏠 We heard your feedback! This repo has been deprecated and each project has moved to a new home in a repo scoped by API and p

Microsoft 970 Nov 28, 2022
Code and data of the EMNLP 2021 paper "Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer"

StyleAttack Code and data of the EMNLP 2021 paper "Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer" Prepare Pois

THUNLP 19 Nov 20, 2022
An Implicit Function Theorem (IFT) optimizer for bi-level optimizations

iftopt An Implicit Function Theorem (IFT) optimizer for bi-level optimizations. Requirements Python 3.7+ PyTorch 1.x Installation $ pip install git+ht

The Money Shredder Lab 2 Dec 02, 2021
某学校选课系统GIF验证码数据集 + Baseline模型 + 上下游相关工具

elective-dataset-2021spring 某学校2021春季选课系统GIF验证码数据集(29338张) + 准确率98.4%的Baseline模型 + 上下游相关工具。 数据集采用 知识共享署名-非商业性使用 4.0 国际许可协议 进行许可。 Baseline模型和上下游相关工具采用

xmcp 27 Sep 17, 2021
A Closer Look at Reference Learning for Fourier Phase Retrieval

A Closer Look at Reference Learning for Fourier Phase Retrieval This repository contains code for our NeurIPS 2021 Workshop on Deep Learning and Inver

Tobias Uelwer 1 Oct 28, 2021
BuildingNet: Learning to Label 3D Buildings

BuildingNet This is the implementation of the BuildingNet architecture described in this paper: Paper: BuildingNet: Learning to Label 3D Buildings Arx

16 Nov 07, 2022
Python library for computer vision labeling tasks. The core functionality is to translate bounding box annotations between different formats-for example, from coco to yolo.

PyLabel pip install pylabel PyLabel is a Python package to help you prepare image datasets for computer vision models including PyTorch and YOLOv5. I

PyLabel Project 176 Jan 01, 2023
Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch

Semantic Segmentation Easy to use and customizable SOTA Semantic Segmentation models with abundant datasets in PyTorch Features Applicable to followin

sithu3 530 Jan 05, 2023
CoaT: Co-Scale Conv-Attentional Image Transformers

CoaT: Co-Scale Conv-Attentional Image Transformers Introduction This repository contains the official code and pretrained models for CoaT: Co-Scale Co

mlpc-ucsd 191 Dec 03, 2022
A python script to dump all the challenges locally of a CTFd-based Capture the Flag.

A python script to dump all the challenges locally of a CTFd-based Capture the Flag. Features Connects and logins to a remote CTFd instance. Dumps all

Podalirius 77 Dec 07, 2022
Pytorch implementation of BRECQ, ICLR 2021

BRECQ Pytorch implementation of BRECQ, ICLR 2021 @inproceedings{ li&gong2021brecq, title={BRECQ: Pushing the Limit of Post-Training Quantization by Bl

Yuhang Li 148 Dec 28, 2022
A PyTorch based deep learning library for drug pair scoring.

Documentation | External Resources | Datasets | Examples ChemicalX is a deep learning library for drug-drug interaction, polypharmacy side effect and

AstraZeneca 597 Dec 30, 2022
A cross-lingual COVID-19 fake news dataset

CrossFake An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below: Cross-lingual COVID-19 Fake News Detection. Jiangshu Du,

Yingtong Dou 11 Dec 01, 2022
Add gui for YoloV5 using PyQt5

HEAD 更新2021.08.16 **添加图片和视频保存功能: 1.图片和视频按照当前系统时间进行命名 2.各自检测结果存放入output文件夹 3.摄像头检测的默认设备序号更改为0,减少调试报错 温馨提示: 1.项目放置在全英文路径下,防止项目报错 2.默认使用cpu进行检测,自

Ruihao Wang 65 Dec 27, 2022
ParmeSan: Sanitizer-guided Greybox Fuzzing

ParmeSan: Sanitizer-guided Greybox Fuzzing ParmeSan is a sanitizer-guided greybox fuzzer based on Angora. Published Work USENIX Security 2020: ParmeSa

VUSec 158 Dec 31, 2022
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models

DSEE Codes for [Preprint] DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models Xuxi Chen, Tianlong Chen, Yu Cheng, Weizhu Ch

VITA 4 Dec 27, 2021
Gans-in-action - Companion repository to GANs in Action: Deep learning with Generative Adversarial Networks

GANs in Action by Jakub Langr and Vladimir Bok List of available code: Chapter 2: Colab, Notebook Chapter 3: Notebook Chapter 4: Notebook Chapter 6: C

GANs in Action 914 Dec 21, 2022