Back to Projects

Song Popularity Prediction

Data Analysis and Feature Engineering

Python Pandas Feature Engineering Linear Regression EDA

Problem to Solve

This project identifies the factors that influence the number of likes a song video receives on YouTube, allowing prediction of popularity based on musical characteristics and engagement metrics. The analysis uses a dataset with 28 variables that combine information from Spotify, YouTube, and musical composition properties.

Model used: Linear Regression

1. Dataset Information

The dataset contains information about Spotify songs and their corresponding videos on YouTube, with 28 variables including musical characteristics, engagement metrics, and metadata.

Categorical Variables (object)

  • Artist: Artist name
  • Url_spotify: Artist URL on Spotify
  • Track: Song name as contained in Spotify
  • Album: Album containing the song
  • Album_type: Album type (solo or compilation)
  • Uri: Spotify link to find the song in the API
  • Url_youtube: Song video URL on YouTube
  • Title: Video title on YouTube
  • Channel: Name of the channel that published the video
  • official_video: Boolean variable indicating if the video is official
  • Description: Video description on YouTube
  • Licensed: Indicates if the video is licensed

Numeric Variables (float64)

  • Danceability: How suitable the song is for dancing
  • Energy: Measure of intensity and activity
  • Key: The key the track is in
  • Loudness: The overall volume of the track in decibels (dB)
  • Speechiness: Detects the presence of spoken words in the track
  • Acousticness: Confidence measure to know if the track is acoustic
  • Instrumentalness: Indicates if a track contains no vocals
  • Liveness: Detects the presence of an audience in the recording
  • Valence: Describes the musical positivity conveyed by a track
  • Tempo: The overall estimated tempo in beats per minute (BPM)
  • Duration_ms: Song duration in milliseconds
  • Views: Views on YouTube
  • Likes: Likes on YouTube
  • Comments: Comments on YouTube
  • Stream: Number of song plays on Spotify

Dataset Example

Artist Track Album Album_type Danceability Energy Views Likes Comments Stream
Gorillaz Feel Good Inc. Demon Days album 0.818 0.705 693,555,221 6,220,896 169,907 1.04e+09
Gorillaz Rhinestone Eyes Plastic Beach album 0.676 0.703 72,011,645 1,079,128 31,003 3.10e+08

2. Exploratory Data Analysis (EDA)

An exploratory analysis was performed focused on answering three main questions about the factors that influence song popularity.

Is there a relationship between song duration and number of likes?

Relationship between duration and likes

Is there a relationship between likes, comments, and views?

Relación entre likes, comentarios y visualizaciones
Visualizaciones, comentarios y likes

Does license influence the number of likes?

The relationship between video license status and their number of likes was analyzed to identify if this factor influences popularity.

License influence on likes

3. Normality Analysis of Numeric Variables

The distribution of numeric variables was analyzed to determine if they followed a normal distribution, which is crucial for deciding which imputation and scaling techniques to use.

Normality analysis

4. Missing Data Treatment

Different imputation strategies were applied according to variable type and distribution:

Normal Variables

Missing data is replaced with the mean

Non-Normal Variables

Missing data is replaced with the median

Categorical Variables

Missing data is replaced with the mode

Implementation Example - Non-Normal Variables

imputador = SimpleImputer(missing_values=np.nan, strategy='median')
# non-normal variables are replaced with the median

transformados = imputador.fit_transform(variables_NOnormales)
# dataset transformation

missing_count = pd.DataFrame(transformados).isnull().sum()
print('The number of nulls in each column is:')
print(missing_count)

5. Outlier Treatment

Atypical values were identified and treated using different methods according to variable distribution.

Normal Variables - Z-Score

For variables with normal distribution, the Z-Score method was used to identify outliers. Values that are more than 3 standard deviations from the mean are considered outliers and are replaced with the mean.

Z-Score: z = (x - μ) / σ

If |z| > 3, then x is an outlier

Non-Normal Variables - IQR (Interquartile Range)

For variables without normal distribution, the IQR (Interquartile Range) method was used. However, due to the large number of outliers found, it was decided to keep them in the dataset to preserve information.

IQR: IQR = Q3 - Q1

Limits: Lower limit = Q1 - 1.5×IQR

Upper limit = Q3 + 1.5×IQR

Visual Example - Danceability Variable

With Outliers

Danceability with outliers

Without Outliers

Danceability without outliers

6. Data Processing

6.1 Categorical Variable Encoding

LabelEncoder from scikit-learn was used to convert categorical variables into numeric values, allowing the model to process them.

Before Encoding

Artist Album_type Licensed
Gorillaz album True
SICK LEGEND single True

After Encoding

Artist Album_type Licensed
687 0 1
1234 2 1

6.2 Variable Scaling

Different scaling techniques were applied according to variable distribution:

Standard Scaler

For normal variables. Normalizes data by subtracting the mean and dividing by the standard deviation.

z = (x - μ) / σ

Robust Scaler

For non-normal variables. Uses the median and IQR, being more robust to outliers.

x_scaled = (x - median) / IQR

6.3 Correlation Analysis

The correlation map was analyzed to identify highly correlated variables and eliminate dependencies, avoiding the model having redundant information.

Correlation map

Variables with high correlation were eliminated to avoid multicollinearity in the linear regression model.

7. Model Results

A Linear Regression model was trained and its performance was evaluated using multiple metrics.

Most Important Variables

It was determined that the following variables are the most important for predicting likes:

Album_type Valence Views Comments
R-squared
0.87
The model explains 87% of the variance in likes
MAE
Mean Absolute Error
0.51
Average absolute error of 0.51 units
MSE
Mean Squared Error
1.66
Mean squared error
RMSE
Root Mean Squared Error
1.29
Root mean squared error

Results Interpretation

The Linear Regression model achieved an R² of 0.87, indicating that 87% of the variability in the number of likes can be explained by the musical and engagement characteristics included in the model. This demonstrates that:

  • Musical characteristics (danceability, energy, tempo, etc.) have a significant influence on video popularity
  • Engagement metrics (views, comments, streams) are strongly related to likes
  • The model is capable of predicting with good accuracy the number of likes a video will receive based on these characteristics