Project Spotify Haseeb
Project Spotify Haseeb
Project Spotify Haseeb
November 4, 2023
1
time signature of the song. — ### 1.1 Provenance: #### Source: Data was collected via the
Spotify API. #### COLLECTION METHODOLOGY: Data is collected daily by querying the
Spotify API for the top 50 songs for each country every day.
[153]: # Since data can contain numberical values to be formated with thousands␣
↪separators and decimals, the number formats are defined here
# nf0 is number format with zero decimals and nf2 is number format with two␣
↪decimals
[157]: df.columns
2
0.5 2.1 Counting the Data
[158]: # checking the dimensions of the dataset
print(df.shape)
(47472, 25)
Rows= 47472
Columns= 25
Size= 1186800
The dataset has 47471 rows and 25 columns and total size is 1186775
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47472 entries, 0 to 47471
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 spotify_id 47472 non-null object
1 name 47471 non-null object
2 artists 47471 non-null object
3 daily_rank 47472 non-null int64
4 daily_movement 47472 non-null int64
5 weekly_movement 47472 non-null int64
6 country 46820 non-null object
7 snapshot_date 47472 non-null object
8 popularity 47472 non-null int64
9 is_explicit 47472 non-null bool
10 duration_ms 47472 non-null int64
11 album_name 47471 non-null object
12 album_release_date 47471 non-null object
13 danceability 47472 non-null float64
14 energy 47472 non-null float64
15 key 47472 non-null int64
16 loudness 47472 non-null float64
17 mode 47472 non-null int64
18 speechiness 47472 non-null float64
19 acousticness 47472 non-null float64
20 instrumentalness 47472 non-null float64
21 liveness 47472 non-null float64
3
22 valence 47472 non-null float64
23 tempo 47472 non-null float64
24 time_signature 47472 non-null int64
dtypes: bool(1), float64(9), int64(8), object(7)
memory usage: 8.7+ MB
spotify_id object
name object
artists object
daily_rank int64
daily_movement int64
weekly_movement int64
country object
snapshot_date object
popularity int64
is_explicit bool
duration_ms int64
album_name object
album_release_date object
danceability float64
energy float64
key int64
loudness float64
mode int64
speechiness float64
acousticness float64
instrumentalness float64
liveness float64
valence float64
tempo float64
time_signature int64
dtype: object
Observation:
From the output, we can see the data types of each column in the dataset. We can see that
most of the columns are of the float64 data type, while some columns such as ‘key’, ‘mode’, and
‘time_signature’ are of the integer data type. We can also see that the ‘is_explicit’ column is of
the boolean data type, and the ‘country’ column is of the object data type.
4
count 47472.000000 47472.000000 47472.000000 47472.000000
mean 25.504655 2.331227 12.008658 77.245682
std 14.440137 9.818369 17.135496 17.971609
min 1.000000 -46.000000 -43.000000 0.000000
25% 13.000000 -1.000000 0.000000 66.000000
50% 25.000000 0.000000 6.000000 83.000000
75% 38.000000 2.000000 25.000000 90.000000
max 50.000000 49.000000 49.000000 100.000000
5
From describe function we can see minimum duration_ms (has 0.0000 ) miliseconds that shows an
anomaly, as we know for song it should have duration atleast minimum to some extend.
spotify_id 0
name 1
artists 1
daily_rank 0
daily_movement 0
weekly_movement 0
country 652
snapshot_date 0
popularity 0
is_explicit 0
duration_ms 0
album_name 1
album_release_date 1
danceability 0
energy 0
key 0
loudness 0
mode 0
speechiness 0
acousticness 0
instrumentalness 0
6
liveness 0
valence 0
tempo 0
time_signature 0
dtype: int64
[167]: df.isnull().sum()
[167]: spotify_id 0
name 0
artists 0
daily_rank 0
daily_movement 0
weekly_movement 0
country 0
snapshot_date 0
popularity 0
is_explicit 0
duration_ms 0
album_name 0
album_release_date 0
danceability 0
energy 0
key 0
loudness 0
mode 0
speechiness 0
acousticness 0
instrumentalness 0
liveness 0
valence 0
tempo 0
time_signature 0
dtype: int64
Observations:
The country column has 652 missing values and are imputed by using fillna() attribute and the
keyword use to fill the missing value is ‘GLO’, whereas due to anomaly there were missing values
7
in following columns and is cleaned; name, artists, album_name and album_release_date.
['GLO' 'ZA' 'VN' 'VE' 'UY' 'US' 'UA' 'TW' 'TR' 'TH' 'SV' 'SK' 'SG' 'SE'
'SA' 'RO' 'PY' 'PT' 'PL' 'PK' 'PH' 'PE' 'PA' 'NZ' 'NO' 'NL' 'NI' 'NG'
'MY' 'MX' 'MA' 'LV' 'LU' 'LT' 'KZ' 'KR' 'JP' 'IT' 'IS' 'IN' 'IL' 'IE'
'ID' 'HU' 'HN' 'HK' 'GT' 'GR' 'GB' 'FR' 'FI' 'ES' 'EG' 'EE' 'EC' 'DO'
'DK' 'DE' 'CZ' 'CR' 'CO' 'CL' 'CH' 'CA' 'BY' 'BR' 'BO' 'BG' 'BE' 'AU'
'AT' 'AR' 'AE']
country
ZA 655
AE 653
HK 653
RO 653
GR 653
DO 653
SK 653
PK 653
HN 653
KR 653
US 653
NI 653
LV 653
AT 653
IL 652
LT 652
IE 652
IS 652
8
GT 652
HU 652
MX 652
CR 652
CO 652
CH 652
CA 652
BY 652
BG 652
AU 652
MA 652
GLO 652
TW 652
SA 652
TH 652
SV 652
VE 652
PA 652
SG 652
EC 651
NG 651
EG 651
EE 651
DK 651
FR 651
DE 651
BR 651
CL 651
TR 651
BO 651
BE 651
UA 651
PY 651
PL 651
ID 651
VN 651
PE 651
IN 651
IT 651
NZ 651
NO 651
KZ 651
NL 651
UY 650
AR 650
MY 650
ES 650
CZ 650
9
FI 650
GB 650
SE 650
PT 650
PH 650
JP 650
LU 557
Name: count, dtype: int64
[171]: plt.figure(figsize=(20,10))
sns.countplot(x='country', data=df)
plt.xticks(rotation=90)
plt.show()
This plot shows the number of songs in each country, with ‘us’ category having the highest bars.
We can use similar methods to check the unique values in other columns of the dataset.
[172]: # using plotly
# Create a box plot of the 'popularity' column grouped by 'country'
fig = px.box(df, x='country', y='popularity', title='Distribution of Song␣
↪Popularity by Country')
fig.show()
Observations:
The plot generated using Plotly is a box plot that visualizes the distribution of song popularity
across various countries. The x-axis of the plot represents the names of the countries, while the
y-axis represents the song popularity.
The box plot reveals that the median song popularity across all countries is approximately 50. Ad-
10
ditionally, there are a few outliers that exhibit exceptionally high song popularity values, exceeding
90.
Insights:
The box plot reveals a roughly symmetric distribution of song popularity with no significant skew-
ness or outliers. The median song popularity is consistent across countries with minor variations.
However, ‘global’ and ‘us’ stand out with a wider range of song popularity compared to other coun-
tries. This plot aids in identifying differences in song popularity between countries and detecting
any outliers in the data.
0.13 2.6 Converting ISO Codes into Country Names (Feature Engineering)
[173]: # inserting new column of countries name
df_a = {
'AE': 'United Arab Emirates',
'AR': 'Argentina',
'AT': 'Austria',
'AU': 'Australia',
'BE': 'Belgium',
'BG': 'Bulgaria',
'BO': 'Bolivia',
'BR': 'Brazil',
'BY': 'Belarus',
'CA': 'Canada',
'CH': 'Switzerland',
'CL': 'Chile',
'CO': 'Colombia',
'CR': 'Costa Rica',
'CZ': 'Czech Republic',
'DE': 'Germany',
'DK': 'Denmark',
'DO': 'Dominican Republic',
'EC': 'Ecuador',
'EE': 'Estonia',
'EG': 'Egypt',
'ES': 'Spain',
'FI': 'Finland',
'FR': 'France',
'GB': 'United Kingdom',
'GR': 'Greece',
'GT': 'Guatemala',
'HK': 'Hong Kong',
'HN': 'Honduras',
'HU': 'Hungary',
'ID': 'Indonesia',
'IE': 'Ireland',
'IL': 'Israel',
11
'IN': 'India',
'IS': 'Iceland',
'IT': 'Italy',
'JP': 'Japan',
'KR': 'South Korea',
'KZ': 'Kazakhstan',
'LT': 'Lithuania',
'LU': 'Luxembourg',
'LV': 'Latvia',
'MA': 'Morocco',
'MX': 'Mexico',
'MY': 'Malaysia',
'NG': 'Nigeria',
'NI': 'Nicaragua',
'NL': 'Netherlands',
'NO': 'Norway',
'NZ': 'New Zealand',
'PA': 'Panama',
'PE': 'Peru',
'PH': 'Philippines',
'PK': 'Pakistan',
'PL': 'Poland',
'PT': 'Portugal',
'PY': 'Paraguay',
'RO': 'Romania',
'SA': 'Saudi Arabia',
'SE': 'Sweden',
'SG': 'Singapore',
'SK': 'Slovakia',
'SV': 'El Salvador',
'TH': 'Thailand',
'TR': 'Turkey',
'TW': 'Taiwan',
'UA': 'Ukraine',
'US': 'United States',
'UY': 'Uruguay',
'VE': 'Venezuela',
'VN': 'Vietnam',
'ZA': 'South Africa',
'GLO': 'Global'
}
12
0.14 Converting ISO Codes into Continent Names
[174]: # Create a dictionary to map countries to continents
df_a = {
'AE': 'Asia',
'AR': 'South America',
'AT': 'Europe',
'AU': 'Australia',
'BE': 'Europe',
'BG': 'Europe',
'BO': 'South America',
'BR': 'South America',
'BY': 'Europe',
'CA': 'North America',
'CH': 'Europe',
'CL': 'South America',
'CO': 'South America',
'CR': 'North America',
'CZ': 'Europe',
'DE': 'Europe',
'DK': 'Europe',
'DO': 'North America',
'EC': 'South America',
'EE': 'Europe',
'EG': 'Africa',
'ES': 'Europe',
'FI': 'Europe',
'FR': 'Europe',
'GB': 'Europe',
'GR': 'Europe',
'GT': 'North America',
'HK': 'Asia',
'HN': 'North America',
'HU': 'Europe',
'ID': 'Asia',
'IE': 'Europe',
'IL': 'Asia',
'IN': 'Asia',
'IS': 'Europe',
'IT': 'Europe',
'JP': 'Asia',
'KR': 'Asia',
'KZ': 'Asia',
'LT': 'Europe',
'LU': 'Europe',
'LV': 'Europe',
'MA': 'Africa',
13
'MX': 'North America',
'MY': 'Asia',
'NG': 'Africa',
'NI': 'North America',
'NL': 'Europe',
'NO': 'Europe',
'NZ': 'Australia',
'PA': 'North America',
'PE': 'South America',
'PH': 'Asia',
'PK': 'Asia',
'PL': 'Europe',
'PT': 'Europe',
'PY': 'South America',
'RO': 'Europe',
'SA': 'Asia',
'SE': 'Europe',
'SG': 'Asia',
'SK': 'Europe',
'SV': 'North America',
'TH': 'Asia',
'TR': 'Asia',
'TW': 'Asia',
'UA': 'Europe',
'US': 'North America',
'UY': 'South America',
'VE': 'South America',
'VN': 'Asia',
'ZA': 'Africa',
'GLO': 'Global'
}
[175]: df.sample(3)
14
popularity is_explicit duration_ms album_name \
25238 86 False 171712 Escolhas, Vol. 2 (Ao Vivo)
14706 91 False 200600 Endless Summer Vacation
27210 82 True 211253 AftërLyfe
Observation:
With feature engineering, we have to two more columns, country_name and continent. It would
be helpful in drawing insightness from data
15
key -0.003984 -0.017157 -0.005418 0.028665
loudness 0.012768 0.002956 0.000534 0.135782
mode -0.018533 0.026016 0.021759 -0.004080
speechiness -0.030605 -0.011524 -0.005243 -0.030781
acousticness -0.038884 -0.025673 -0.009280 0.063380
instrumentalness 0.035882 -0.006505 -0.017926 -0.018431
liveness 0.020771 -0.000941 -0.009572 -0.017512
valence -0.030000 -0.003513 0.017122 0.012218
tempo 0.002532 -0.008484 -0.012380 0.029008
time_signature 0.061793 -0.001543 -0.025036 -0.093753
16
tempo 0.055602 -0.054428 0.098449 -0.016943
time_signature -0.088716 0.063387 0.149629 -0.068462
time_signature
daily_rank 0.061793
daily_movement -0.001543
weekly_movement -0.025036
popularity -0.093753
is_explicit -0.080238
duration_ms 0.117771
danceability 0.031002
energy 0.010674
key -0.036058
loudness -0.088716
mode 0.063387
speechiness 0.149629
acousticness -0.068462
instrumentalness 0.019382
liveness 0.025719
valence -0.138847
tempo -0.000777
time_signature 1.000000
17
# Identify the columns with non-numeric values
non_numeric_cols = []
for col in df.columns:
if df[col].dtype == 'object':
non_numeric_cols.append(col)
Observation:
heatmap of the correlation matrix, where the color of each cell represents the strength and direction
of the correlation between two features. The diagonal of the heatmap will be all red, indicating
perfect correlation between a feature and itself. The off-diagonal cells will be colored according to
the strength and direction of the correlation between two different features. The color scale ranges
from blue (negative correlation) to white (no correlation) to red (positive correlation). We can use
this plot to identify any strong correlations between features and gain insights into the relationships
between different features in the dataset.
18
of various song features across different countries. Data cleaning is just one application of EDA,
where you ask questions about whether your data meets your expectations or not. The summary
statistics of the dataset reveal important information such as the range of the ‘duration_ms’ column
and the mean and standard deviation of the ‘popularity’ column. An anomaly was detected in
the ‘duration_ms’ column, which was cleaned. The box plot of song popularity reveals that the
median song popularity is consistent across countries with minor variations. Feature engineering
was performed to create two more columns, ‘country_name’ and ‘continent’. A heatmap of the
correlation matrix was used to identify any strong correlations between features. The distribution
of different time signatures of songs, the valence of songs, the distribution of different keys of songs,
and the tempo of songs were also analyzed using count plots, box plots, and scatter plots. These
visualizations helped to identify any differences in the prevalence of these features across different
countries.
↪'duration_ms']
19
Density Plot
[179]: # Select the numerical columns
num_cols = ['daily_rank', 'danceability', 'energy', 'loudness', 'speechiness',␣
↪'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',␣
↪'duration_ms']
20
# Update the legend titles
for i in range(len(num_cols)):
fig.update_traces(row=i+1, col=1, showlegend=True)
Observation:
The histogram of daily rank is skewed to the right, indicating that most songs have a low daily
rank.
The histogram of danceability is roughly normal, indicating that the danceability of songs is evenly
distributed.
The histogram of energy is skewed to the left, indicating that most songs have low energy.
The histogram of loudness is roughly normal, indicating that the loudness of songs is evenly dis-
tributed.
The histogram of speechiness is skewed to the right, indicating that most songs have a low speech-
iness.
The histogram of acousticness is skewed to the right, indicating that most songs have a low acous-
ticness.
The histogram of instrumentalness is skewed to the right, indicating that most songs have a low
instrumentalness.
The histogram of liveness is skewed to the right, indicating that most songs have a low liveness.
The histogram of valence is roughly normal, indicating that the valence of songs is evenly dis-
tributed.
The histogram of tempo is roughly normal, indicating that the tempo of songs is evenly distributed.
The histogram of duration_ms is skewed to the right, indicating that most songs have a short du-
ration.
↪'duration_ms']
21
# Show the plot
fig.show()
Observations:
the heat map shows that there is a strong positive correlation between ‘loudness’ and ‘energy’, and
a strong negative correlation between ‘acousticness’ and ‘loudness’. This indicates that songs with
higher energy tend to be louder, and songs with higher acousticness tend to be quieter. Additionally,
there is a moderate positive correlation between ‘valence’ and ‘energy’, indicating that songs with
higher energy tend to have a more positive mood.
22
'CA': 'Canada',
'CH': 'Switzerland',
'CL': 'Chile',
'CO': 'Colombia',
'CR': 'Costa Rica',
'CZ': 'Czech Republic',
'DE': 'Germany',
'DK': 'Denmark',
'DO': 'Dominican Republic',
'EC': 'Ecuador',
'EE': 'Estonia',
'EG': 'Egypt',
'ES': 'Spain',
'FI': 'Finland',
'FR': 'France',
'GB': 'United Kingdom',
'GR': 'Greece',
'GT': 'Guatemala',
'HK': 'Hong Kong',
'HN': 'Honduras',
'HU': 'Hungary',
'ID': 'Indonesia',
'IE': 'Ireland',
'IL': 'Israel',
'IN': 'India',
'IS': 'Iceland',
'IT': 'Italy',
'JP': 'Japan',
'KR': 'South Korea',
'KZ': 'Kazakhstan',
'LT': 'Lithuania',
'LU': 'Luxembourg',
'LV': 'Latvia',
'MA': 'Morocco',
'MX': 'Mexico',
'MY': 'Malaysia',
'NG': 'Nigeria',
'NI': 'Nicaragua',
'NL': 'Netherlands',
'NO': 'Norway',
'NZ': 'New Zealand',
'PA': 'Panama',
'PE': 'Peru',
'PH': 'Philippines',
'PK': 'Pakistan',
'PL': 'Poland',
'PT': 'Portugal',
23
'PY': 'Paraguay',
'RO': 'Romania',
'SA': 'Saudi Arabia',
'SE': 'Sweden',
'SG': 'Singapore',
'SK': 'Slovakia',
'SV': 'El Salvador',
'TH': 'Thailand',
'TR': 'Turkey',
'TW': 'Taiwan',
'UA': 'Ukraine',
'US': 'United States',
'UY': 'Uruguay',
'VE': 'Venezuela',
'VN': 'Vietnam',
'ZA': 'South Africa',
'GLO': 'Global'
}
24
'FR': 'Europe',
'GB': 'Europe',
'GR': 'Europe',
'GT': 'North America',
'HK': 'Asia',
'HN': 'North America',
'HU': 'Europe',
'ID': 'Asia',
'IE': 'Europe',
'IL': 'Asia',
'IN': 'Asia',
'IS': 'Europe',
'IT': 'Europe',
'JP': 'Asia',
'KR': 'Asia',
'KZ': 'Asia',
'LT': 'Europe',
'LU': 'Europe',
'LV': 'Europe',
'MA': 'Africa',
'MX': 'North America',
'MY': 'Asia',
'NG': 'Africa',
'NI': 'North America',
'NL': 'Europe',
'NO': 'Europe',
'NZ': 'Australia',
'PA': 'North America',
'PE': 'South America',
'PH': 'Asia',
'PK': 'Asia',
'PL': 'Europe',
'PT': 'Europe',
'PY': 'South America',
'RO': 'Europe',
'SA': 'Asia',
'SE': 'Europe',
'SG': 'Asia',
'SK': 'Europe',
'SV': 'North America',
'TH': 'Asia',
'TR': 'Asia',
'TW': 'Asia',
'UA': 'Europe',
'US': 'North America',
'UY': 'South America',
'VE': 'South America',
25
'VN': 'Asia',
'ZA': 'Africa',
'GLO': 'Global'
}
[184]: df.sample(5)
26
29237 0.508 133.629 4 United Arab Emirates Asia
27792 0.202 142.837 4 Hungary Europe
4351 0.890 124.997 4 Saudi Arabia Asia
4375 0.968 149.957 4 Saudi Arabia Asia
Question 1: What is the distribution of daily rank, weekly movement, and popularity of songs
across different countries?
[185]: # Create a box plot for daily rank
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='daily_rank', ax=ax)
ax.set_title('Distribution of Daily Rank Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Daily Rank')
# Rotate the x-axis labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
Observations:
The distribution of daily rank of songs varies across different countries. Some countries have a
higher median daily rank compared to others, indicating that songs in those countries tend to have
a lower daily rank. The x-axis shows the country names, and the y-axis shows the daily rank of
songs. We can use this plot to identify any differences in the distribution of daily rank across
different countries.
27
[186]: # Create a box plot for weekly movement
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='weekly_movement', ax=ax)
ax.set_title('Distribution of Weekly Movement Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Weekly Movement')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
# Show the plot
plt.show()
Observation:
Distribution of Weekly Movement Across Different Countries:
The distribution of weekly movement of songs varies across different countries. Some countries have
a higher median weekly movement compared to others, indicating that songs in those countries tend
to have a higher weekly movement. The x-axis shows the country names, and the y-axis shows the
weekly movement of songs. We can use this plot to identify any differences in the distribution of
weekly movement across different countries.
[187]: # Create a box plot for popularity
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='popularity', ax=ax)
ax.set_title('Distribution of Popularity Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Popularity')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
28
# Show the plot
plt.show()
observations:
Distribution of Popularity Across Different Countries:
The distribution of popularity of songs varies across different countries. Some countries have a
higher median popularity compared to others, indicating that songs in those countries tend to be
more popular. The x-axis shows the country names, and the y-axis shows the popularity of songs.
We can use this plot to identify any differences in the distribution of popularity across different
countries.
Question 2: How does the duration of songs vary across different countries?
[188]: # Create a box plot of the 'duration_ms' column grouped by 'country'
# Create a box plot for duration
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='duration_ms', ax=ax)
ax.set_title('Distribution of Song Duration Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Duration (ms)')
29
Insights: The duration of songs varies across different countries. Some countries have a higher
median song duration compared to others, indicating that songs in those countries tend to be
longer. By visualizing the distribution of song duration across different countries, we can identify
any differences in the duration of songs across different countries.
Question 3: What is the correlation between danceability and energy of songs?
[189]: # Create a box plot for danceability and energy
fig = px.box(df, x='danceability', y='energy', title='Correlation Between␣
↪Danceability and Energy of Songs')
fig.update_layout(xaxis_title='Danceability', yaxis_title='Energy')
fig.show()
Insight: From the box plot, we can see that there is a positive correlation between danceability and
energy of songs. Songs that are more danceable tend to have higher energy levels. The box plot also
shows the distribution of danceability and energy values, with the median, quartiles, and outliers.
By visualizing the correlation between danceability and energy of songs using an eye-catching box
plot, we can better understand the relationship between these two variables.
Question 4: How does the loudness of songs vary across different countries?
[190]: # Create a box plot for loudness
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='loudness', ax=ax)
ax.set_title('Distribution of Song Loudness Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Loudness')
30
# Rotate the x-axis labels
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
Insights:
The loudness of songs varies across different countries. Some countries have a higher median loud-
ness compared to others, indicating that songs in those countries tend to be louder. By visualizing
the distribution of song loudness across different countries, we can identify any differences in the
loudness of songs across different countries.
Question 5: What is the distribution of explicit and non-explicit songs across different countries?
[191]: # Create a count plot for explicit and non-explicit songs
fig, ax = plt.subplots(figsize=(20,10))
sns.countplot(data=df, x='country_name', hue='is_explicit', ax=ax)
ax.set_title('Distribution of Explicit and Non-Explicit Songs Across Different␣
↪Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Count')
31
plt.show()
Insights:
The distribution of explicit and non-explicit songs varies across different countries. Some countries
have a higher number of explicit songs compared to others, indicating that songs in those countries
tend to be more explicit. By visualizing the distribution of explicit and non-explicit songs across
different countries, we can identify any differences in the prevalence of explicit songs across different
countries.
Question 6: How does the acousticness of songs vary across different countries?
[192]: # Create a violin plot for acousticness
fig, ax = plt.subplots(figsize=(18, 8))
sns.violinplot(data=df, x='country_name', y='acousticness', ax=ax)
ax.set_title('Distribution of Acousticness Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Acousticness')
32
[193]: # Create a violin plot for acousticness
fig = px.violin(df, x='continent', y='acousticness', title='Distribution of␣
↪Acousticness Across Different Countries',
Insights:
we can see that the distribution of acousticness values varies across different countries in Country
figure and it also vary while drawn along continent. Some countries have a higher density of acoustic
songs compared to others, indicating that songs in those countries tend to be more acoustic. By
visualizing the distribution of acousticness values using a violin plot, we can identify any differences
in the prevalence of acoustic songs across different countries and different continents.
Question 7: What is the distribution of different time signatures of songs across different countries?
[194]: # Create a count plot for time signature
fig, ax = plt.subplots(figsize=(18, 8))
sns.countplot(data=df, x='country_name', hue='time_signature', ax=ax)
ax.set_title('Distribution of Time Signatures Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Count')
33
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
Insight:
The distribution of different time signatures of songs varies across different countries. Some coun-
tries have a higher count of songs with a particular time signature compared to others, indicating
that songs in those countries tend to have a certain time signature. By visualizing the distribution
of time signatures using a count plot, we can identify any differences in the prevalence of different
time signatures across different countries
Question 8: How does the valence of songs vary across different countries?
[195]: # Create a box plot for valence
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='valence', ax=ax)
ax.set_title('Distribution of Valence Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Valence')
34
Insight:
The valence of songs varies across different countries. Some countries have a higher median valence
compared to others, indicating that songs in those countries tend to be more positive. By visualizing
the distribution of valence using a box plot, we can identify any differences in the valence of songs
across different countries.
Question 9: What is the distribution of different keys of songs across different countries/
35
Insight:
The distribution of different keys of songs varies across different countries. Some countries have
a higher count of songs in a particular key compared to others, indicating that songs in those
countries tend to be in a certain key. By visualizing the distribution of keys using a count plot, we
can identify any differences in the prevalence of different keys across different countries.
Question 10: How does the tempo of songs vary across different countries?
[197]: # Create a box plot for tempo
fig, ax = plt.subplots(figsize=(18, 8))
sns.boxplot(data=df, x='country_name', y='tempo', ax=ax)
ax.set_title('Distribution of Tempo Across Different Countries')
ax.set_xlabel('Country_Name')
ax.set_ylabel('Tempo')
36
Insights:
The tempo of songs varies across different countries. Some countries have a higher median tempo
compared to others, indicating that songs in those countries tend to be faster. By visualizing the
distribution of tempo using a box plot, we can identify any differences in the tempo of songs across
different countries.
0.22 Conclusions:
The distribution of daily rank, weekly movement, and popularity of songs varies across different
countries.
The duration of songs varies across different countries.
There is a positive correlation between danceability and energy of songs.
The loudness of songs varies across different countries.
The distribution of explicit and non-explicit songs varies across different countries.
The distribution of acousticness values varies across different countries.
The distribution of different time signatures of songs varies across different countries.
The valence of songs varies across different countries.
The distribution of different keys of songs varies across different countries.
The tempo of songs varies across different countries.
By performing EDA on the ‘top spotify songs in 73 countries’ dataset, we can identify any dif-
ferences in the distribution of various song features across different countries. This can help us
understand the preferences of listeners in different countries and inform music production and
marketing strategies
37