Analyzing the Spotify Dataset with Machine Learning and Deep Learning Model

Analyzing the Spotify Dataset with Machine Learning and Deep Learning Model

folder_920_201707260845-1

Introduction

Spotify is one of the most popular music streaming platforms in the world, with over 356 million monthly active users as of December 2020. As a music streaming service, Spotify has a wealth of data on user preferences and behavior, including information on what songs users like and dislike.

In this analysis, I will be using a dataset from Kaggle that contains information on the audio features of songs and whether they are liked or not. Our goal is to build machine learning models to predict whether a song is liked or not based on its audio features.

Dataset Description

The Spotify dataset contains information on over 170,000 songs, including 13 audio features such as danceability, energy, and loudness. The target variable is whether the song is liked or not, represented by a binary value of 1 or 0.

  • danceability: How suitable is the track for dancing, ranging from 0 to 1.
  • energy: Describes energetic the song is, ranging from 0 to 1.
  • key: The key the track is in.
  • loudness: Describes how loud the song is
  • Mode: Detects Major or Minor of the track
  • speechiness: Detects the precense of spoken words, ranging from 0 to 1.
  • acousticness: whether a track is acoustic or not
  • instrumentalness: Predicts whether a track contains no vocals
  • liveness: Measures how likely the song was recorded live, ranging from 0 to 1.
  • valence: Musical positiveness conveyed by a track, ranging from 0 to 1.
  • tempo: tempo of the song, in beats per minute.
  • duration_ms: Duration of the song, in milliseconds.
  • time_signature: Indicates number of beats per measure
  • liked: whether a song is liked or not

Exploratory Data Analysis

To get a better understanding of the Spotify dataset, I created some visualizations of the data. From the histograms, I can see that most of the audio features are normally distributed, with the exception of duration_ms, which is skewed to the right.

Like vs Dislike

Understanding the proportion of songs that users tend to like and for identifying any imbalances in the dataset that may need to be addressed during modeling and analysis.

Correlation Matrix

The output is typically presented in a matrix format, with each cell representing the correlation coefficient between two features. The diagonal line of the matrix is always 1 since it represents the correlation between a feature and itself. The matrix can be visualized as a heatmap, with high correlation coefficients represented by brighter colors and low or negative correlation coefficients represented by darker colors.

Feature Engineering

Feature engineering could involve selecting relevant audio features such as danceability, tempo, loudness, and acousticness, and transforming them into more useful features for the machine learning algorithms. For example, the duration of a song in milliseconds might not be a useful feature, but transforming it into duration in minutes could make it more meaningful.

Duration of Song vs likes/dislikes

A duration of less than 4 minutes is relatively short, especially compared to classical or instrumental music where pieces can often be much longer. This information could be useful for understanding the preferences of listeners in the dataset and may also inform decisions related to song selection or playlist curation.

More the Danceability more the likes

The “intensity” is a new variable created by multiplying the “loudness” and “tempo” variables for each song in the dataset. This new variable represents a measure of how “intense” the song is based on the combination of its loudness and tempo. The scatter plot shows that there is a general trend where tracks with higher danceability tend to have higher intensity scores as well and most likes.

Generate the mood of the song

The aim of the code is to categorize songs based on their overall mood, which is determined using the valence score, energy, and tempo of the song using valence, energy and tempo.

Identify the Genre of the song with KMeans Clustering (Unsupervised)

Genres can be identify in the Spotify dataset using KMeans clustering. It selects a subset of features, performs feature engineering, drops missing values, and applies KMeans clustering to the data to obtain genre cluster labels. The cluster labels are then mapped to genre names using a dictionary, and a countplot is created to visualize the distribution of genre names in the Spotify dataset.

Machine Learning and Deep Learning Model

Built four different machine learning models to predict whether a song is liked or not based on its audio features

Logistic Regression

Logistic regression is a simple but powerful classification algorithm that uses a linear model to predict the probability of a binary outcome. I used the logistic regression implementation from scikit-learn with default hyperparameters.

Random Forest

Random forest is an ensemble learning algorithm that combines multiple decision trees to improve accuracy and reduce overfitting. I used the random forest implementation from scikit-learn with 100 trees and default hyperparameters.

XGBoost

XGBoost is another ensemble learning algorithm that is designed to be highly scalable and efficient. I used the XGBoost implementation from the xgboost library with 100 trees and default hyperparameters.

Neural Network

A neural network is a type of deep learning algorithm that is inspired by the structure of the human brain. I used a simple neural network with two hidden layers and the ReLU activation function.

Model Evaluation and Accuracy

To evaluate the performance of each model, I split the dataset into training and testing sets and trained each model on the training set. Here are the accuracy scores of each model on the test set:

Complete Code

Conclusion

In this analysis, I used machine learning models to predict whether a song is liked or not based on its audio features. I found that XGBoost and Logistic Regression performed the best out of the four models, with an accuracy score of 85%. This type of analysis could be useful for music streaming platforms like Spotify to personalize their recommendations to users. Although due to less data, Deep learning model’s accuracy can be ignored. 

2 Comments

  1. jack resy

    looks good. why deep learning model’s accuracy is low?

  2. Julia Fernz

    Awesome. Keep it up!

Leave A Comment

Your email address will not be published. Required fields are marked *

Instagram

Archives