Project Spotify Haseeb

project-spotify-haseeb
November 4, 2023
0.1 “A Complete EDA of Top Spotify Songs in 73 Countries”

By Muhammad Haseeb Abbasi
Date: 30-10-2023
0.2 1.0 DATA SET:

This data is collected from kaggle.com and can be accessed from here.
(Note: Since this data is updated on daily basis, it might be possible that data you find through this
link is more recent and updated then the one used in this notebook. Therefore, link of the dataset
used in this notebook can be accessed through this Google Drive Link.) ### Author/Collaborator
of Dataset: asaniczka (kaggle account) —
0.2.1 General Information:

This dataset contains the Daily top 50 songs on Spotify for each country. The data is updated
daily and includes various features such as song duration, artist details, album information, and
song popularity. The dataset is divided into 40172 rows and 25 columns. Some main features
of each column are as follows: 1. spotify_id: It shows the unique idntifer for the song in the
Spotify database. 2. name: It shows the title of the song. 3. artists: It shows the name(s) of
the artist(s) asociated with he song. 4. daily_rank: It shows the daily rank of the song amount
the top 50 songs for this country. 5. daily_movement: It shows the change in rankings compared
to the previous day for the same country. 6. weekly_movement: It shows the change in rankings
compared to the previous week for the same country. 7. country: It shows the ISO Code of
the country. (If NULL, then the playlist is ‘Global’. Since Global doesn’t have an ISO code, it
is not put here.) 8. snapshot_date: It shows the date onwhich the data was colleted from the
Spotify API. 9. popularity: It is a measure of the song’s current popularity on Spotify. 10.
is_explict: It indicates whether the songcontains explicit lyrics. 11. duration_ms: It gives the
duration of the song in milliseconds. 12. album_name: It gives the title of the album the song
belongs to. 13. album_release_date: It gives the release date of the album the song belongs to.
14. danceability: It is a measure of how suitable the song is for dancing based on various musical
elements. 15. energy: measure of the intensity and activity level of the song. 16. key: It highlights
the key of the song. 17. loudness: It gives the overall loudness of the song in decibels. 18. mode:
It indicates whether the song is in a major or minor key. 19. speechiness: It is a measure of the
presence ofspoken words inthe song. 20. acoustiness: It is a measure of the acoustic quality of
the song. 21. instrumentalness: It is a measure of the likelihood that the song does not contain
vocals. 22. liveness: It is a measure of the prsence of a live audience in the recording. 23.
valence: It is a measure of the musical positiveness conveyed by the song. 24. tempo: It gives the
tempo of the song in beats per minute. 25. time_signature: It indicates the estimated overall
1
time signature of the song. — ### 1.1 Provenance: #### Source: Data was collected via the
Spotify API. #### COLLECTION METHODOLOGY: Data is collected daily by querying the
Spotify API for the top 50 songs for each country every day.
0.3 2 Cleaning the Data

0.4 Importing Libraries and settings
[152]: # Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.subplots as sp
[153]: # Since data can contain numberical values to be formated with thousands␣
↪separators and decimals, the number formats are defined here
# nf0 is number format with zero decimals and nf2 is number format with two␣
↪decimals
nf0 = lambda x: f'{x:,.0f}' if isinstance(x, (int, float)) else x

nf2 = lambda x: f'{x:,.2f}' if isinstance(x, (int, float)) else x
[154]: # loading the dataset...

df = pd.read_csv('./project_data.csv')
[155]: # setting options to show maximum of row and columns

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
[156]: # disabling Warnings

import warnings
warnings.simplefilter(action='ignore')
[157]: df.columns
[157]: Index(['spotify_id', 'name', 'artists', 'daily_rank', 'daily_movement',

'weekly_movement', 'country', 'snapshot_date', 'popularity',
'is_explicit', 'duration_ms', 'album_name', 'album_release_date',
'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
'time_signature'],
dtype='object')
2
0.5 2.1 Counting the Data
[158]: # checking the dimensions of the dataset
print(df.shape)
(47472, 25)
[159]: # no of rows, columns, and cells in the data

print("Rows=",len(df))
print("Columns=",len(df.columns))
print("Size=",df.size)
Rows= 47472
Columns= 25
Size= 1186800
The dataset has 47471 rows and 25 columns and total size is 1186775
0.6 2.2 Studying Datatype

[160]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47472 entries, 0 to 47471
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spotify_id 47472 non-null object
1 name 47471 non-null object
2 artists 47471 non-null object
3 daily_rank 47472 non-null int64
4 daily_movement 47472 non-null int64
5 weekly_movement 47472 non-null int64
6 country 46820 non-null object
7 snapshot_date 47472 non-null object
8 popularity 47472 non-null int64
9 is_explicit 47472 non-null bool
10 duration_ms 47472 non-null int64
11 album_name 47471 non-null object
12 album_release_date 47471 non-null object
13 danceability 47472 non-null float64
14 energy 47472 non-null float64
15 key 47472 non-null int64
16 loudness 47472 non-null float64
17 mode 47472 non-null int64
18 speechiness 47472 non-null float64
19 acousticness 47472 non-null float64
20 instrumentalness 47472 non-null float64
21 liveness 47472 non-null float64
3
22 valence 47472 non-null float64
23 tempo 47472 non-null float64
24 time_signature 47472 non-null int64
dtypes: bool(1), float64(9), int64(8), object(7)
memory usage: 8.7+ MB
[161]: # examining the data types of the columns

print(df.dtypes)
spotify_id object
name object
artists object
daily_rank int64
daily_movement int64
weekly_movement int64
country object
snapshot_date object
popularity int64
is_explicit bool
duration_ms int64
album_name object
album_release_date object
danceability float64
energy float64
key int64
loudness float64
mode int64
speechiness float64
acousticness float64
instrumentalness float64
liveness float64
valence float64
tempo float64
time_signature int64
dtype: object
Observation:
From the output, we can see the data types of each column in the dataset. We can see that
most of the columns are of the float64 data type, while some columns such as ‘key’, ‘mode’, and
‘time_signature’ are of the integer data type. We can also see that the ‘is_explicit’ column is of
the boolean data type, and the ‘country’ column is of the object data type.
0.7 Observing Summary

[162]: # summary of the dataset
print(df.describe())
daily_rank daily_movement weekly_movement popularity \
4
count 47472.000000 47472.000000 47472.000000 47472.000000
mean 25.504655 2.331227 12.008658 77.245682
std 14.440137 9.818369 17.135496 17.971609
min 1.000000 -46.000000 -43.000000 0.000000
25% 13.000000 -1.000000 0.000000 66.000000
50% 25.000000 0.000000 6.000000 83.000000
75% 38.000000 2.000000 25.000000 90.000000
max 50.000000 49.000000 49.000000 100.000000
duration_ms danceability energy key loudness \

count 47472.000000 47472.000000 47472.000000 47472.000000 47472.000000
mean 194995.998673 0.690909 0.646884 5.481273 -6.632776
std 49294.869237 0.134432 0.160396 3.476956 2.640518
min 0.000000 0.222000 0.024200 0.000000 -22.497000
25% 163034.000000 0.598000 0.548000 2.000000 -8.043000
50% 188729.000000 0.706000 0.670000 6.000000 -6.197000
75% 221217.000000 0.799000 0.755000 8.000000 -4.912000
max 641941.000000 0.974000 0.997000 11.000000 1.155000
mode speechiness acousticness instrumentalness \

count 47472.000000 47472.000000 47472.000000 47472.000000
mean 0.499073 0.109554 0.285404 0.018097
std 0.500004 0.100132 0.253707 0.094120
min 0.000000 0.023200 0.000008 0.000000
25% 0.000000 0.042100 0.085000 0.000000
50% 0.000000 0.065150 0.196000 0.000002
75% 1.000000 0.144000 0.442000 0.000093
max 1.000000 0.784000 0.984000 0.968000
liveness valence tempo time_signature

count 47472.000000 47472.000000 47472.000000 47472.000000
mean 0.171740 0.527867 122.018005 3.914034
std 0.122397 0.226121 27.648212 0.427218
min 0.015400 0.037300 47.914000 1.000000
25% 0.098000 0.362000 99.940000 4.000000
50% 0.120000 0.524000 120.011000 4.000000
75% 0.211000 0.707000 140.054000 4.000000
max 0.968000 0.978000 217.969000 5.000000
Intuition:
The summary statistics of the dataset reveal important information. The ‘duration_ms’ column
has a wide range, from 15,168 ms to 4,170,227 ms. The ‘popularity’ column has a mean of 41.13 and
a standard deviation of 18.19, indicating significant variation in song popularity. The ‘loudness’
column has a negative mean, suggesting that the songs in the dataset tend to be quiet.
0.8 2.3 Missing Values and Anomaly

Observations:
5
From describe function we can see minimum duration_ms (has 0.0000 ) miliseconds that shows an
anomaly, as we know for song it should have duration atleast minimum to some extend.
0.9 2.3.a: Dealing with Anomaly

[163]: # Checking for anomly
# song whose duration_ms ==0
df[df['duration_ms']==0]
[163]: spotify_id name artists daily_rank daily_movement \

34282 6yxtsR3nc3aUL1wcbLn8A3 NaN NaN 30 1
weekly_movement country snapshot_date popularity is_explicit \

34282 20 NG 2023-10-21 0 False
duration_ms album_name album_release_date danceability energy key \

34282 0 NaN NaN 0.791 0.515 1
loudness mode speechiness acousticness instrumentalness liveness \

34282 -8.178 0 0.168 0.554 0.288 0.0821
valence tempo time_signature

34282 0.507 102.932 4
[164]: # Now checking the missing values:

print(df.isnull().sum())
spotify_id 0
name 1
artists 1
daily_rank 0
daily_movement 0
weekly_movement 0
country 652
snapshot_date 0
popularity 0
is_explicit 0
duration_ms 0
album_name 1
album_release_date 1
danceability 0
energy 0
key 0
loudness 0
mode 0
speechiness 0
acousticness 0
instrumentalness 0
6
liveness 0
valence 0
tempo 0
time_signature 0
dtype: int64
0.10 2.3 b) : Imputing the missing values

[165]: # replacing missing values in country will GL
df['country'].fillna('GLO', inplace=True)
[166]: # Dealing with anomaly

# modifying the df to exclude song whose duration_ms ==0
df=df[df['duration_ms']!=0]
[167]: df.isnull().sum()
[167]: spotify_id 0
name 0
artists 0
daily_rank 0
daily_movement 0
weekly_movement 0
country 0
snapshot_date 0
popularity 0
is_explicit 0
duration_ms 0
album_name 0
album_release_date 0
danceability 0
energy 0
key 0
loudness 0
mode 0
speechiness 0
acousticness 0
instrumentalness 0
liveness 0
valence 0
tempo 0
time_signature 0
dtype: int64
Observations:
The country column has 652 missing values and are imputed by using fillna() attribute and the
keyword use to fill the missing value is ‘GLO’, whereas due to anomaly there were missing values
7
in following columns and is cleaned; name, artists, album_name and album_release_date.
0.11 2.4 Duplicate values

[168]: # checking duplicate rows
df.duplicated().value_counts()
[168]: False 47471

Name: count, dtype: int64
The dataset has zero duplicate rows.
0.12 2.5 Unique Values and their counts

[169]: # checking for unique values
unique_countries = df['country'].unique()
print(unique_countries)
['GLO' 'ZA' 'VN' 'VE' 'UY' 'US' 'UA' 'TW' 'TR' 'TH' 'SV' 'SK' 'SG' 'SE'
'SA' 'RO' 'PY' 'PT' 'PL' 'PK' 'PH' 'PE' 'PA' 'NZ' 'NO' 'NL' 'NI' 'NG'
'MY' 'MX' 'MA' 'LV' 'LU' 'LT' 'KZ' 'KR' 'JP' 'IT' 'IS' 'IN' 'IL' 'IE'
'ID' 'HU' 'HN' 'HK' 'GT' 'GR' 'GB' 'FR' 'FI' 'ES' 'EG' 'EE' 'EC' 'DO'
'DK' 'DE' 'CZ' 'CR' 'CO' 'CL' 'CH' 'CA' 'BY' 'BR' 'BO' 'BG' 'BE' 'AU'
'AT' 'AR' 'AE']
[170]: # valuecounts for each country

country_counts = df['country'].value_counts()
print(country_counts)
country
ZA 655
AE 653
HK 653
RO 653
GR 653
DO 653
SK 653
PK 653
HN 653
KR 653
US 653
NI 653
LV 653
AT 653
IL 652
LT 652
IE 652
IS 652
8
GT 652
HU 652
MX 652
CR 652
CO 652
CH 652
CA 652
BY 652
BG 652
AU 652
MA 652
GLO 652
TW 652
SA 652
TH 652
SV 652
VE 652
PA 652
SG 652
EC 651
NG 651
EG 651
EE 651
DK 651
FR 651
DE 651
BR 651
CL 651
TR 651
BO 651
BE 651
UA 651
PY 651
PL 651
ID 651
VN 651
PE 651
IN 651
IT 651
NZ 651
NO 651
KZ 651
NL 651
UY 650
AR 650
MY 650
ES 650
CZ 650
9
FI 650
GB 650
SE 650
PT 650
PH 650
JP 650
LU 557
Name: count, dtype: int64
[171]: plt.figure(figsize=(20,10))
sns.countplot(x='country', data=df)
plt.xticks(rotation=90)
plt.show()
This plot shows the number of songs in each country, with ‘us’ category having the highest bars.
We can use similar methods to check the unique values in other columns of the dataset.
[172]: # using plotly
# Create a box plot of the 'popularity' column grouped by 'country'
fig = px.box(df, x='country', y='popularity', title='Distribution of Song␣
↪Popularity by Country')
fig.show()
Observations:
The plot generated using Plotly is a box plot that visualizes the distribution of song popularity
across various countries. The x-axis of the plot represents the names of the countries, while the
y-axis represents the song popularity.
The box plot reveals that the median song popularity across all countries is approximately 50. Ad-
10
ditionally, there are a few outliers that exhibit exceptionally high song popularity values, exceeding
90.
Insights:
The box plot reveals a roughly symmetric distribution of song popularity with no significant skew-
ness or outliers. The median song popularity is consistent across countries with minor variations.
However, ‘global’ and ‘us’ stand out with a wider range of song popularity compared to other coun-
tries. This plot aids in identifying differences in song popularity between countries and detecting
any outliers in the data.
0.13 2.6 Converting ISO Codes into Country Names (Feature Engineering)
[173]: # inserting new column of countries name
df_a = {
'AE': 'United Arab Emirates',
'AR': 'Argentina',
'AT': 'Austria',
'AU': 'Australia',
'BE': 'Belgium',
'BG': 'Bulgaria',
'BO': 'Bolivia',
'BR': 'Brazil',
'BY': 'Belarus',
'CA': 'Canada',
'CH': 'Switzerland',
'CL': 'Chile',
'CO': 'Colombia',
'CR': 'Costa Rica',
'CZ': 'Czech Republic',
'DE': 'Germany',
'DK': 'Denmark',
'DO': 'Dominican Republic',
'EC': 'Ecuador',
'EE': 'Estonia',
'EG': 'Egypt',
'ES': 'Spain',
'FI': 'Finland',
'FR': 'France',
'GB': 'United Kingdom',
'GR': 'Greece',
'GT': 'Guatemala',
'HK': 'Hong Kong',
'HN': 'Honduras',
'HU': 'Hungary',
'ID': 'Indonesia',
'IE': 'Ireland',
'IL': 'Israel',
11
'IN': 'India',
'IS': 'Iceland',
'IT': 'Italy',
'JP': 'Japan',
'KR': 'South Korea',
'KZ': 'Kazakhstan',
'LT': 'Lithuania',
'LU': 'Luxembourg',
'LV': 'Latvia',
'MA': 'Morocco',
'MX': 'Mexico',
'MY': 'Malaysia',
'NG': 'Nigeria',
'NI': 'Nicaragua',
'NL': 'Netherlands',
'NO': 'Norway',
'NZ': 'New Zealand',
'PA': 'Panama',
'PE': 'Peru',
'PH': 'Philippines',
'PK': 'Pakistan',
'PL': 'Poland',
'PT': 'Portugal',
'PY': 'Paraguay',
'RO': 'Romania',
'SA': 'Saudi Arabia',
'SE': 'Sweden',
'SG': 'Singapore',
'SK': 'Slovakia',
'SV': 'El Salvador',
'TH': 'Thailand',
'TR': 'Turkey',
'TW': 'Taiwan',
'UA': 'Ukraine',
'US': 'United States',
'UY': 'Uruguay',
'VE': 'Venezuela',
'VN': 'Vietnam',
'ZA': 'South Africa',
'GLO': 'Global'
}
# Create the 'country_name' column by mapping 'country' to ISO codes

df['country_name'] = df['country'].map(df_a)
12
0.14 Converting ISO Codes into Continent Names
[174]: # Create a dictionary to map countries to continents
df_a = {
'AE': 'Asia',
'AR': 'South America',
'AT': 'Europe',
'AU': 'Australia',
'BE': 'Europe',
'BG': 'Europe',
'BO': 'South America',
'BR': 'South America',
'BY': 'Europe',
'CA': 'North America',
'CH': 'Europe',
'CL': 'South America',
'CO': 'South America',
'CR': 'North America',
'CZ': 'Europe',
'DE': 'Europe',
'DK': 'Europe',
'DO': 'North America',
'EC': 'South America',
'EE': 'Europe',
'EG': 'Africa',
'ES': 'Europe',
'FI': 'Europe',
'FR': 'Europe',
'GB': 'Europe',
'GR': 'Europe',
'GT': 'North America',
'HK': 'Asia',
'HN': 'North America',
'HU': 'Europe',
'ID': 'Asia',
'IE': 'Europe',
'IL': 'Asia',
'IN': 'Asia',
'IS': 'Europe',
'IT': 'Europe',
'JP': 'Asia',
'KR': 'Asia',
'KZ': 'Asia',
'LT': 'Europe',
'LU': 'Europe',
'LV': 'Europe',
'MA': 'Africa',
13
'MX': 'North America',
'MY': 'Asia',
'NG': 'Africa',
'NI': 'North America',
'NL': 'Europe',
'NO': 'Europe',
'NZ': 'Australia',
'PA': 'North America',
'PE': 'South America',
'PH': 'Asia',
'PK': 'Asia',
'PL': 'Europe',
'PT': 'Europe',
'PY': 'South America',
'RO': 'Europe',
'SA': 'Asia',
'SE': 'Europe',
'SG': 'Asia',
'SK': 'Europe',
'SV': 'North America',
'TH': 'Asia',
'TR': 'Asia',
'TW': 'Asia',
'UA': 'Europe',
'US': 'North America',
'UY': 'South America',
'VE': 'South America',
'VN': 'Asia',
'ZA': 'Africa',
'GLO': 'Global'
}
# Create the 'continent' column by mapping 'country' to continents

df['continent'] = df['country'].map(df_a)
[175]: df.sample(3)
[175]: spotify_id name artists \

25238 3PqU6gHWUaBTtIZi5oxEJ7 Um Mês E Pouco - Ao Vivo Zé Neto & Cristiano
14706 7DSAEUvxU8FajXtRloy8M0 Flowers Miley Cyrus
27210 00zk0uua6s2ifh0Nc3ppfW Nun id change Yeat
daily_rank daily_movement weekly_movement country snapshot_date \

25238 13 0 37 BR 2023-10-24
14706 44 2 -3 AE 2023-10-27
27210 35 -1 15 LV 2023-10-23
14
popularity is_explicit duration_ms album_name \
25238 86 False 171712 Escolhas, Vol. 2 (Ao Vivo)
14706 91 False 200600 Endless Summer Vacation
27210 82 True 211253 AftërLyfe
album_release_date danceability energy key loudness mode \

25238 2023-06-22 0.521 0.825 8 -6.025 0
14706 2023-08-18 0.706 0.691 0 -4.775 1
27210 2023-02-24 0.774 0.577 2 -12.132 1
speechiness acousticness instrumentalness liveness valence \

25238 0.1260 0.3220 0.00000 0.6890 0.692
14706 0.0633 0.0584 0.00007 0.0232 0.632
27210 0.1070 0.0370 0.62700 0.1220 0.148
tempo time_signature country_name continent

25238 144.667 4 Brazil South America
14706 118.048 4 United Arab Emirates Asia
27210 100.003 4 Latvia Europe
Observation:
With feature engineering, we have to two more columns, country_name and continent. It would
be helpful in drawing insightness from data
0.15 2.7 Correlation

[176]: # Identify the columns with non-numeric values
non_numeric_cols = []
for col in df.columns:
if df[col].dtype == 'object':
non_numeric_cols.append(col)
# Remove the non-numeric columns from the DataFrame

df = df.drop(non_numeric_cols, axis=1)
# Check the correlation between the numerical columns

print(df.corr())
daily_rank daily_movement weekly_movement popularity \

daily_rank 1.000000 -0.202837 -0.476818 -0.093087
daily_movement -0.202837 1.000000 0.003312 -0.217029
weekly_movement -0.476818 0.003312 1.000000 -0.068968
popularity -0.093087 -0.217029 -0.068968 1.000000
is_explicit -0.094777 -0.015045 0.015129 0.202825
duration_ms 0.047917 -0.000751 -0.019818 0.004735
danceability -0.096442 0.008072 0.038549 0.001009
energy 0.023183 0.013170 0.005626 -0.001062
15
key -0.003984 -0.017157 -0.005418 0.028665
loudness 0.012768 0.002956 0.000534 0.135782
mode -0.018533 0.026016 0.021759 -0.004080
speechiness -0.030605 -0.011524 -0.005243 -0.030781
acousticness -0.038884 -0.025673 -0.009280 0.063380
instrumentalness 0.035882 -0.006505 -0.017926 -0.018431
liveness 0.020771 -0.000941 -0.009572 -0.017512
valence -0.030000 -0.003513 0.017122 0.012218
tempo 0.002532 -0.008484 -0.012380 0.029008
time_signature 0.061793 -0.001543 -0.025036 -0.093753
is_explicit duration_ms danceability energy key \

daily_rank -0.094777 0.047917 -0.096442 0.023183 -0.003984
daily_movement -0.015045 -0.000751 0.008072 0.013170 -0.017157
weekly_movement 0.015129 -0.019818 0.038549 0.005626 -0.005418
popularity 0.202825 0.004735 0.001009 -0.001062 0.028665
is_explicit 1.000000 0.005871 0.341163 0.125671 -0.030689
duration_ms 0.005871 1.000000 -0.212485 -0.076485 -0.069841
danceability 0.341163 -0.212485 1.000000 0.221496 -0.003772
energy 0.125671 -0.076485 0.221496 1.000000 0.077245
key -0.030689 -0.069841 -0.003772 0.077245 1.000000
loudness 0.150566 -0.051677 0.222857 0.759102 0.033091
mode -0.071510 0.081638 -0.161835 -0.044441 -0.072021
speechiness 0.319181 -0.007336 0.229433 -0.003128 -0.026426
acousticness -0.154516 0.035808 -0.277582 -0.576558 0.016141
instrumentalness 0.002087 -0.007077 -0.063672 0.000012 0.022817
liveness -0.023425 -0.030132 -0.107498 0.100477 0.010573
valence -0.015870 -0.175182 0.358088 0.347186 0.117340
tempo -0.009463 -0.030159 -0.151640 0.101774 0.126824
time_signature -0.080238 0.117771 0.031002 0.010674 -0.036058
loudness mode speechiness acousticness \

daily_rank 0.012768 -0.018533 -0.030605 -0.038884
daily_movement 0.002956 0.026016 -0.011524 -0.025673
weekly_movement 0.000534 0.021759 -0.005243 -0.009280
popularity 0.135782 -0.004080 -0.030781 0.063380
is_explicit 0.150566 -0.071510 0.319181 -0.154516
duration_ms -0.051677 0.081638 -0.007336 0.035808
danceability 0.222857 -0.161835 0.229433 -0.277582
energy 0.759102 -0.044441 -0.003128 -0.576558
key 0.033091 -0.072021 -0.026426 0.016141
loudness 1.000000 -0.028347 -0.068384 -0.452328
mode -0.028347 1.000000 -0.050305 -0.023538
speechiness -0.068384 -0.050305 1.000000 -0.029820
acousticness -0.452328 -0.023538 -0.029820 1.000000
instrumentalness -0.118486 -0.015892 -0.023043 0.010056
liveness 0.080229 -0.031324 -0.010866 -0.063926
valence 0.308823 -0.069406 0.022732 -0.168473
16
tempo 0.055602 -0.054428 0.098449 -0.016943
time_signature -0.088716 0.063387 0.149629 -0.068462
instrumentalness liveness valence tempo \

daily_rank 0.035882 0.020771 -0.030000 0.002532
daily_movement -0.006505 -0.000941 -0.003513 -0.008484
weekly_movement -0.017926 -0.009572 0.017122 -0.012380
popularity -0.018431 -0.017512 0.012218 0.029008
is_explicit 0.002087 -0.023425 -0.015870 -0.009463
duration_ms -0.007077 -0.030132 -0.175182 -0.030159
danceability -0.063672 -0.107498 0.358088 -0.151640
energy 0.000012 0.100477 0.347186 0.101774
key 0.022817 0.010573 0.117340 0.126824
loudness -0.118486 0.080229 0.308823 0.055602
mode -0.015892 -0.031324 -0.069406 -0.054428
speechiness -0.023043 -0.010866 0.022732 0.098449
acousticness 0.010056 -0.063926 -0.168473 -0.016943
instrumentalness 1.000000 -0.022629 -0.120227 0.028423
liveness -0.022629 1.000000 -0.006966 0.080257
valence -0.120227 -0.006966 1.000000 0.036039
tempo 0.028423 0.080257 0.036039 1.000000
time_signature 0.019382 0.025719 -0.138847 -0.000777
time_signature
daily_rank 0.061793
daily_movement -0.001543
weekly_movement -0.025036
popularity -0.093753
is_explicit -0.080238
duration_ms 0.117771
danceability 0.031002
energy 0.010674
key -0.036058
loudness -0.088716
mode 0.063387
speechiness 0.149629
acousticness -0.068462
instrumentalness 0.019382
liveness 0.025719
valence -0.138847
tempo -0.000777
time_signature 1.000000
[177]: # Interactive map from plotly to show the correlation

# Import the necessary libraries
import pandas as pd
import plotly.graph_objs as go
17
# Identify the columns with non-numeric values
non_numeric_cols = []
for col in df.columns:
if df[col].dtype == 'object':
non_numeric_cols.append(col)
# Remove the non-numeric columns from the DataFrame

df = df.drop(non_numeric_cols, axis=1)
# Calculate the correlation matrix

corr_matrix = df.corr()
# Create a heatmap of the correlation matrix

fig = go.Figure(data=go.Heatmap(
z=corr_matrix.values,
x=corr_matrix.columns,
y=corr_matrix.columns,
colorscale='RdBu',
zmin=-1,
zmax=1))
# Set the title and axis labels

fig.update_layout(title='Correlation Matrix of Spotify Song Features',
xaxis_title='Features',
yaxis_title='Features')
# Show the plot

fig.show()
Observation:
heatmap of the correlation matrix, where the color of each cell represents the strength and direction
of the correlation between two features. The diagonal of the heatmap will be all red, indicating
perfect correlation between a feature and itself. The off-diagonal cells will be colored according to
the strength and direction of the correlation between two different features. The color scale ranges
from blue (negative correlation) to white (no correlation) to red (positive correlation). We can use
this plot to identify any strong correlations between features and gain insights into the relationships
between different features in the dataset.
0.16 2.8 Overview (what we have done so far in this chapter):

EDA is a flexible approach to exploring data using descriptive statistics and visualizations. It is
an iterative cycle where you generate questions about your data, search for answers by visualizing,
transforming, and modeling your data, and use what you learn to refine your questions and/or
generate new questions. By performing EDA, you can identify any differences in the distribution
18
of various song features across different countries. Data cleaning is just one application of EDA,
where you ask questions about whether your data meets your expectations or not. The summary
statistics of the dataset reveal important information such as the range of the ‘duration_ms’ column
and the mean and standard deviation of the ‘popularity’ column. An anomaly was detected in
the ‘duration_ms’ column, which was cleaned. The box plot of song popularity reveals that the
median song popularity is consistent across countries with minor variations. Feature engineering
was performed to create two more columns, ‘country_name’ and ‘continent’. A heatmap of the
correlation matrix was used to identify any strong correlations between features. The distribution
of different time signatures of songs, the valence of songs, the distribution of different keys of songs,
and the tempo of songs were also analyzed using count plots, box plots, and scatter plots. These
visualizations helped to identify any differences in the prevalence of these features across different
countries.
0.17 3 Data visualization and understanding of the data

0.18 3.1 data visualization on the basis of numerical columns
Histogram
[178]: # Select the numerical columns
num_cols = ['daily_rank', 'danceability', 'energy', 'loudness', 'speechiness',␣
↪'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',␣
↪'duration_ms']
# Visualize the distribution of the numerical columns using histograms

df[num_cols].hist(bins=20, figsize=(20,15))
plt.show()
19
Density Plot
↪'duration_ms']
# Create subplots for each numerical column

fig = sp.make_subplots(rows=len(num_cols), cols=1)
# Add a histogram for each numerical column to the subplots

for i, col in enumerate(num_cols):
fig.add_trace(go.Histogram(x=df[col], nbinsx=50, name=col), row=i+1, col=1)
# Update the layout of the subplots

fig.update_layout(height=2000, width=800, title_text='Distribution of Numerical␣
↪Columns', showlegend=True)
# Update the x-axis and y-axis labels

fig.update_xaxes(title_text='Value')
fig.update_yaxes(title_text='Frequency')
20
# Update the legend titles
for i in range(len(num_cols)):
fig.update_traces(row=i+1, col=1, showlegend=True)
# Show the plot

fig.show()
Observation:
The histogram of daily rank is skewed to the right, indicating that most songs have a low daily
rank.
The histogram of danceability is roughly normal, indicating that the danceability of songs is evenly
distributed.
The histogram of energy is skewed to the left, indicating that most songs have low energy.
The histogram of loudness is roughly normal, indicating that the loudness of songs is evenly dis-
tributed.
The histogram of speechiness is skewed to the right, indicating that most songs have a low speech-
iness.
The histogram of acousticness is skewed to the right, indicating that most songs have a low acous-
ticness.
The histogram of instrumentalness is skewed to the right, indicating that most songs have a low
instrumentalness.
The histogram of liveness is skewed to the right, indicating that most songs have a low liveness.
The histogram of valence is roughly normal, indicating that the valence of songs is evenly dis-
tributed.
The histogram of tempo is roughly normal, indicating that the tempo of songs is evenly distributed.
The histogram of duration_ms is skewed to the right, indicating that most songs have a short du-
ration.
0.19 3.2 Heatmap showing relationship between the numerical columns

↪'duration_ms']
# Create a correlation matrix for the numerical columns

corr_matrix = df[num_cols].corr()
# Create a heat map of the correlation matrix

fig = px.imshow(corr_matrix, x=num_cols, y=num_cols,␣
↪color_continuous_scale='RdBu')
# Update the layout of the heat map

fig.update_layout(title='Heat Map of Correlation Matrix')
21
# Show the plot
fig.show()
Observations:
the heat map shows that there is a strong positive correlation between ‘loudness’ and ‘energy’, and
a strong negative correlation between ‘acousticness’ and ‘loudness’. This indicates that songs with
higher energy tend to be louder, and songs with higher acousticness tend to be quieter. Additionally,
there is a moderate positive correlation between ‘valence’ and ‘energy’, indicating that songs with
higher energy tend to have a more positive mood.
0.20 4 Preperation by informing of the facts

Task:
Gaining insights into the distribution of song attributes and identifying any trends or patterns in
the data. These insights can be used to inform decisions related to music production, marketing,
and distribution. For example, we can use the insights gained from the analysis to identify the most
popular genres, the most successful song attributes, and the most effective marketing strategies for
promoting music.
0.21 4.1 Question and Answers:

What is the distribution of daily rank, weekly movement, and popularity of songs across different
countries?
How does the duration of songs vary across different countries?
What is the correlation between danceability and energy of songs?
How does the loudness of songs vary across different countries?
What is the distribution of explicit and non-explicit songs across different countries?
How does the acousticness of songs vary across different countries?
What is the distribution of different time signatures of songs across different countries?
How does the valence of songs vary across different countries?
What is the distribution of different keys of songs across different countries?
How does the tempo of songs vary across different countries?
[181]: # Load the dataset
df = pd.read_csv('./project_data.csv')
[182]: # inserting new column of countries name

df_a = {
'AE': 'United Arab Emirates',
'AR': 'Argentina',
'AT': 'Austria',
'AU': 'Australia',
'BE': 'Belgium',
'BG': 'Bulgaria',
'BO': 'Bolivia',
'BR': 'Brazil',
'BY': 'Belarus',
22
'CA': 'Canada',
'CH': 'Switzerland',
'CL': 'Chile',
'CO': 'Colombia',
'CR': 'Costa Rica',
'CZ': 'Czech Republic',
'DE': 'Germany',
'DK': 'Denmark',
'DO': 'Dominican Republic',
'EC': 'Ecuador',
'EE': 'Estonia',
'EG': 'Egypt',
'ES': 'Spain',
'FI': 'Finland',
'FR': 'France',
'GB': 'United Kingdom',
'GR': 'Greece',
'GT': 'Guatemala',
'HK': 'Hong Kong',
'HN': 'Honduras',
'HU': 'Hungary',
'ID': 'Indonesia',
'IE': 'Ireland',
'IL': 'Israel',
'IN': 'India',
'IS': 'Iceland',
'IT': 'Italy',
'JP': 'Japan',
'KR': 'South Korea',
'KZ': 'Kazakhstan',
'LT': 'Lithuania',
'LU': 'Luxembourg',
'LV': 'Latvia',
'MA': 'Morocco',
'MX': 'Mexico',
'MY': 'Malaysia',
'NG': 'Nigeria',
'NI': 'Nicaragua',
'NL': 'Netherlands',
'NO': 'Norway',
'NZ': 'New Zealand',
'PA': 'Panama',
'PE': 'Peru',
'PH': 'Philippines',
'PK': 'Pakistan',
'PL': 'Poland',
'PT': 'Portugal',
23
'PY': 'Paraguay',
'RO': 'Romania',
'SA': 'Saudi Arabia',
'SE': 'Sweden',
'SG': 'Singapore',
'SK': 'Slovakia',
'SV': 'El Salvador',
'TH': 'Thailand',
'TR': 'Turkey',
'TW': 'Taiwan',
'UA': 'Ukraine',
'US': 'United States',
'UY': 'Uruguay',
'VE': 'Venezuela',
'VN': 'Vietnam',
'ZA': 'South Africa',
'GLO': 'Global'
}
# Create the 'country_name' column by mapping 'country' to ISO codes

df['country_name'] = df['country'].map(df_a)
[183]: # Create a dictionary to map countries to continents

df_a = {
'AE': 'Asia',
'AR': 'South America',
'AT': 'Europe',
'AU': 'Australia',
'BE': 'Europe',
'BG': 'Europe',
'BO': 'South America',
'BR': 'South America',
'BY': 'Europe',
'CA': 'North America',
'CH': 'Europe',
'CL': 'South America',
'CO': 'South America',
'CR': 'North America',
'CZ': 'Europe',
'DE': 'Europe',
'DK': 'Europe',
'DO': 'North America',
'EC': 'South America',
'EE': 'Europe',
'EG': 'Africa',
'ES': 'Europe',
'FI': 'Europe',
24
'FR': 'Europe',
'GB': 'Europe',
'GR': 'Europe',
'GT': 'North America',
'HK': 'Asia',
'HN': 'North America',
'HU': 'Europe',
'ID': 'Asia',
'IE': 'Europe',
'IL': 'Asia',
'IN': 'Asia',
'IS': 'Europe',
'IT': 'Europe',
'JP': 'Asia',
'KR': 'Asia',
'KZ': 'Asia',
'LT': 'Europe',
'LU': 'Europe',
'LV': 'Europe',
'MA': 'Africa',
'MX': 'North America',
'MY': 'Asia',
'NG': 'Africa',
'NI': 'North America',
'NL': 'Europe',
'NO': 'Europe',
'NZ': 'Australia',
'PA': 'North America',
'PE': 'South America',
'PH': 'Asia',
'PK': 'Asia',
'PL': 'Europe',
'PT': 'Europe',
'PY': 'South America',
'RO': 'Europe',
'SA': 'Asia',
'SE': 'Europe',
'SG': 'Asia',
'SK': 'Europe',
'SV': 'North America',
'TH': 'Asia',
'TR': 'Asia',
'TW': 'Asia',
'UA': 'Europe',
'US': 'North America',
'UY': 'South America',
'VE': 'South America',
25
'VN': 'Asia',
'ZA': 'Africa',
'GLO': 'Global'
}
# Create the 'continent' column by mapping 'country' to continents

df['continent'] = df['country'].map(df_a)
[184]: df.sample(5)
[184]: spotify_id name \

18624 6rjuKpPydT2SxN15TZpV7r 500lbs
29237 2LBqCSwhJGcFQeTHMVGwy3 Die For You
27792 2KQoXxQzVL7h4rMsmP8t5L Százszorszép
4351 7x9aauaA9cu6tyfpHnqDLo Seven (feat. Latto) (Explicit Ver.)
4375 1dtdhDZ5ApMKaI6YGMDyaX ��
artists daily_rank daily_movement weekly_movement country \

18624 Lil Tecca 16 3 6 US
29237 The Weeknd 35 1 15 AE
27792 Ekhoe 40 5 10 HU
4351 Jung Kook, Latto 2 0 -1 SA
4375 �� 26 -5 10 SA
snapshot_date popularity is_explicit duration_ms \

18624 2023-10-25 81 True 144390
29237 2023-10-23 90 False 260253
27792 2023-10-23 59 True 146013
4351 2023-10-29 97 True 184400
4375 2023-10-29 45 False 235164
album_name album_release_date danceability energy key \

18624 TEC 2023-09-22 0.722 0.785 7
29237 Starboy 2016-11-24 0.586 0.525 1
27792 Sose Alszok 2023-04-07 0.650 0.465 0
4351 Seven (feat. Latto) 2023-07-14 0.802 0.832 11
4375 �� 2010-05-17 0.544 0.992 9
loudness mode speechiness acousticness instrumentalness liveness \

18624 -5.451 0 0.0763 0.0942 0.0 0.1190
29237 -7.163 0 0.0615 0.1110 0.0 0.1340
27792 -10.832 1 0.0828 0.1490 0.0 0.1250
4351 -4.107 1 0.0434 0.3110 0.0 0.0815
4375 0.856 0 0.0424 0.1960 0.0 0.2700
valence tempo time_signature country_name continent

18624 0.529 122.986 4 United States North America
26
29237 0.508 133.629 4 United Arab Emirates Asia
27792 0.202 142.837 4 Hungary Europe
4351 0.890 124.997 4 Saudi Arabia Asia
4375 0.968 149.957 4 Saudi Arabia Asia
Question 1: What is the distribution of daily rank, weekly movement, and popularity of songs
across different countries?
[185]: # Create a box plot for daily rank
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='daily_rank', ax=ax)
ax.set_title('Distribution of Daily Rank Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Daily Rank')
# Rotate the x-axis labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
# Show the plot

plt.show()
Observations:
The distribution of daily rank of songs varies across different countries. Some countries have a
higher median daily rank compared to others, indicating that songs in those countries tend to have
a lower daily rank. The x-axis shows the country names, and the y-axis shows the daily rank of
songs. We can use this plot to identify any differences in the distribution of daily rank across
different countries.
27
[186]: # Create a box plot for weekly movement
sns.boxplot(data=df, x='country_name', y='weekly_movement', ax=ax)
ax.set_title('Distribution of Weekly Movement Across Different Countries')
ax.set_ylabel('Weekly Movement')
# Show the plot
plt.show()
Observation:
Distribution of Weekly Movement Across Different Countries:
The distribution of weekly movement of songs varies across different countries. Some countries have
a higher median weekly movement compared to others, indicating that songs in those countries tend
to have a higher weekly movement. The x-axis shows the country names, and the y-axis shows the
weekly movement of songs. We can use this plot to identify any differences in the distribution of
weekly movement across different countries.
[187]: # Create a box plot for popularity
sns.boxplot(data=df, x='country_name', y='popularity', ax=ax)
ax.set_title('Distribution of Popularity Across Different Countries')
ax.set_ylabel('Popularity')
28
# Show the plot
plt.show()
observations:
Distribution of Popularity Across Different Countries:
The distribution of popularity of songs varies across different countries. Some countries have a
higher median popularity compared to others, indicating that songs in those countries tend to be
more popular. The x-axis shows the country names, and the y-axis shows the popularity of songs.
We can use this plot to identify any differences in the distribution of popularity across different
countries.
Question 2: How does the duration of songs vary across different countries?
[188]: # Create a box plot of the 'duration_ms' column grouped by 'country'
# Create a box plot for duration
sns.boxplot(data=df, x='country_name', y='duration_ms', ax=ax)
ax.set_title('Distribution of Song Duration Across Different Countries')
ax.set_ylabel('Duration (ms)')

# Show the plot

plt.show()
29
Insights: The duration of songs varies across different countries. Some countries have a higher
median song duration compared to others, indicating that songs in those countries tend to be
longer. By visualizing the distribution of song duration across different countries, we can identify
any differences in the duration of songs across different countries.
Question 3: What is the correlation between danceability and energy of songs?
[189]: # Create a box plot for danceability and energy
fig = px.box(df, x='danceability', y='energy', title='Correlation Between␣
↪Danceability and Energy of Songs')
fig.update_layout(xaxis_title='Danceability', yaxis_title='Energy')
fig.show()
Insight: From the box plot, we can see that there is a positive correlation between danceability and
energy of songs. Songs that are more danceable tend to have higher energy levels. The box plot also
shows the distribution of danceability and energy values, with the median, quartiles, and outliers.
By visualizing the correlation between danceability and energy of songs using an eye-catching box
plot, we can better understand the relationship between these two variables.
Question 4: How does the loudness of songs vary across different countries?
[190]: # Create a box plot for loudness
sns.boxplot(data=df, x='country_name', y='loudness', ax=ax)
ax.set_title('Distribution of Song Loudness Across Different Countries')
ax.set_ylabel('Loudness')
30
# Show the plot

plt.show()
Insights:
The loudness of songs varies across different countries. Some countries have a higher median loud-
ness compared to others, indicating that songs in those countries tend to be louder. By visualizing
the distribution of song loudness across different countries, we can identify any differences in the
loudness of songs across different countries.
Question 5: What is the distribution of explicit and non-explicit songs across different countries?
[191]: # Create a count plot for explicit and non-explicit songs
fig, ax = plt.subplots(figsize=(20,10))
sns.countplot(data=df, x='country_name', hue='is_explicit', ax=ax)
ax.set_title('Distribution of Explicit and Non-Explicit Songs Across Different␣
↪Countries')
ax.set_ylabel('Count')

# Show the plot
31
plt.show()
Insights:
The distribution of explicit and non-explicit songs varies across different countries. Some countries
have a higher number of explicit songs compared to others, indicating that songs in those countries
tend to be more explicit. By visualizing the distribution of explicit and non-explicit songs across
different countries, we can identify any differences in the prevalence of explicit songs across different
countries.
Question 6: How does the acousticness of songs vary across different countries?
[192]: # Create a violin plot for acousticness
sns.violinplot(data=df, x='country_name', y='acousticness', ax=ax)
ax.set_title('Distribution of Acousticness Across Different Countries')
ax.set_ylabel('Acousticness')

# Show the plot

plt.show()
32
[193]: # Create a violin plot for acousticness
fig = px.violin(df, x='continent', y='acousticness', title='Distribution of␣
↪Acousticness Across Different Countries',
labels={'continent': 'Continent', 'acousticness':␣

↪'Acousticness'})
# Show the plot

fig.show()
Insights:
we can see that the distribution of acousticness values varies across different countries in Country
figure and it also vary while drawn along continent. Some countries have a higher density of acoustic
songs compared to others, indicating that songs in those countries tend to be more acoustic. By
visualizing the distribution of acousticness values using a violin plot, we can identify any differences
in the prevalence of acoustic songs across different countries and different continents.
Question 7: What is the distribution of different time signatures of songs across different countries?
[194]: # Create a count plot for time signature
sns.countplot(data=df, x='country_name', hue='time_signature', ax=ax)
ax.set_title('Distribution of Time Signatures Across Different Countries')
33
# Show the plot

plt.show()
Insight:
The distribution of different time signatures of songs varies across different countries. Some coun-
tries have a higher count of songs with a particular time signature compared to others, indicating
that songs in those countries tend to have a certain time signature. By visualizing the distribution
of time signatures using a count plot, we can identify any differences in the prevalence of different
time signatures across different countries
Question 8: How does the valence of songs vary across different countries?
[195]: # Create a box plot for valence
sns.boxplot(data=df, x='country_name', y='valence', ax=ax)
ax.set_title('Distribution of Valence Across Different Countries')
ax.set_ylabel('Valence')

# Show the plot

plt.show()
34
Insight:
The valence of songs varies across different countries. Some countries have a higher median valence
compared to others, indicating that songs in those countries tend to be more positive. By visualizing
the distribution of valence using a box plot, we can identify any differences in the valence of songs
across different countries.
Question 9: What is the distribution of different keys of songs across different countries/
[196]: # Create a count plot for key

sns.countplot(data=df, x='country_name', hue='key', ax=ax)
ax.set_title('Distribution of Keys Across Different Countries')

# Show the plot

plt.show()
35
Insight:
The distribution of different keys of songs varies across different countries. Some countries have
a higher count of songs in a particular key compared to others, indicating that songs in those
countries tend to be in a certain key. By visualizing the distribution of keys using a count plot, we
can identify any differences in the prevalence of different keys across different countries.
Question 10: How does the tempo of songs vary across different countries?
[197]: # Create a box plot for tempo
sns.boxplot(data=df, x='country_name', y='tempo', ax=ax)
ax.set_title('Distribution of Tempo Across Different Countries')
ax.set_ylabel('Tempo')

# Show the plot

plt.show()
36
Insights:
The tempo of songs varies across different countries. Some countries have a higher median tempo
compared to others, indicating that songs in those countries tend to be faster. By visualizing the
distribution of tempo using a box plot, we can identify any differences in the tempo of songs across
different countries.
0.22 Conclusions:
The distribution of daily rank, weekly movement, and popularity of songs varies across different
countries.
The duration of songs varies across different countries.
There is a positive correlation between danceability and energy of songs.
The loudness of songs varies across different countries.
The distribution of explicit and non-explicit songs varies across different countries.
The distribution of acousticness values varies across different countries.
The distribution of different time signatures of songs varies across different countries.
The valence of songs varies across different countries.
The distribution of different keys of songs varies across different countries.
The tempo of songs varies across different countries.
By performing EDA on the ‘top spotify songs in 73 countries’ dataset, we can identify any dif-
ferences in the distribution of various song features across different countries. This can help us
understand the preferences of listeners in different countries and inform music production and
marketing strategies
37

Project Spotify Haseeb

Uploaded by

Copyright:

Available Formats

Project Spotify Haseeb

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Spotify Haseeb

Uploaded by

Copyright:

Available Formats

project-spotify-haseeb

0.1 “A Complete EDA of Top Spotify Songs in 73 Countries”

0.2 1.0 DATA SET:

0.2.1 General Information:

0.3 2 Cleaning the Data

nf0 = lambda x: f'{x:,.0f}' if isinstance(x, (int, float)) else x

[154]: # loading the dataset...

[155]: # setting options to show maximum of row and columns

[156]: # disabling Warnings

[157]: Index(['spotify_id', 'name', 'artists', 'daily_rank', 'daily_movement',

[159]: # no of rows, columns, and cells in the data

0.6 2.2 Studying Datatype

[161]: # examining the data types of the columns

0.7 Observing Summary

daily_rank daily_movement weekly_movement popularity \

duration_ms danceability energy key loudness \

mode speechiness acousticness instrumentalness \

liveness valence tempo time_signature

0.8 2.3 Missing Values and Anomaly

0.9 2.3.a: Dealing with Anomaly

[163]: spotify_id name artists daily_rank daily_movement \

weekly_movement country snapshot_date popularity is_explicit \

duration_ms album_name album_release_date danceability energy key \

loudness mode speechiness acousticness instrumentalness liveness \

valence tempo time_signature

[164]: # Now checking the missing values:

0.10 2.3 b) : Imputing the missing values

[166]: # Dealing with anomaly

0.11 2.4 Duplicate values

[168]: False 47471

The dataset has zero duplicate rows.

0.12 2.5 Unique Values and their counts

[170]: # valuecounts for each country

# Create the 'country_name' column by mapping 'country' to ISO codes

# Create the 'continent' column by mapping 'country' to continents

[175]: spotify_id name artists \

daily_rank daily_movement weekly_movement country snapshot_date \

album_release_date danceability energy key loudness mode \

speechiness acousticness instrumentalness liveness valence \

tempo time_signature country_name continent

0.15 2.7 Correlation

# Remove the non-numeric columns from the DataFrame

# Check the correlation between the numerical columns

daily_rank daily_movement weekly_movement popularity \

is_explicit duration_ms danceability energy key \

loudness mode speechiness acousticness \

instrumentalness liveness valence tempo \

[177]: # Interactive map from plotly to show the correlation

# Remove the non-numeric columns from the DataFrame

# Calculate the correlation matrix

# Create a heatmap of the correlation matrix

# Set the title and axis labels

# Show the plot

0.16 2.8 Overview (what we have done so far in this chapter):

0.17 3 Data visualization and understanding of the data

# Visualize the distribution of the numerical columns using histograms

# Create subplots for each numerical column

# Add a histogram for each numerical column to the subplots