0% found this document useful (0 votes)
43 views11 pages

DAV Project

Uploaded by

sahabaheer860
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views11 pages

DAV Project

Uploaded by

sahabaheer860
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

COMPUTER SCIENCE 1

DATA ANALYSIS AND


VISUALIZATION

Deen Dayal Upadhyaya College


(University of Delhi)
Sector-3, Dwarka · New Delhi-110078

Submitted to: Submitted By:


Mr. Raj Kumar Sharma Peeyush Verma
Mrs. Megha Bansal Kunal Sharma
Professor Yash Kumar
Roll no.-22HCS4178
B.SC CS(H)

1
COMPUTER SCIENCE 2

Spotify-2023 Analysis
Kaggle

Clean the given Data


import pandas as pd

# Load the dataset


df_spotify = pd.read_csv(r'C:\Users\preml\Desktop\3rd sem\DAV\spotify-2023.
↪csv', encoding='ISO-8859-1')

# Check for missing values


missing_values = df_spotify.isnull().sum()

# Check for duplicate rows


duplicate_rows = df_spotify.duplicated().sum()

# Output the findings


print('Missing values in each column:\n', missing_values)
print('\nNumber of duplicate rows:', duplicate_rows)

Missing values in each column:


track_name 0
artist(s)_name 0
artist_count 0
released_year 0
released_month 0
released_day 0
in_spotify_playlists 0
in_spotify_charts 0
streams 0
in_apple_playlists 0
in_apple_charts 0
in_deezer_playlists 0
in_deezer_charts 0
in_shazam_charts 50
bpm 0
key 95

2
COMPUTER SCIENCE 3

mode 0
danceability_% 0
valence_% 0
energy_% 0
acousticness_% 0
instrumentalness_% 0
liveness_% 0
speechiness_% 0
dtype: int64

Number of duplicate rows: 0

Result:

There are no missing values in any column, and there are no duplicate rows in the dataset. The
data appears to be clean and ready for analysis.

1.0.1 Visualization for artist name and their number of tracks

[13]: import pandas as pd

# Load the dataset


spotify_data = pd.read_csv(r'C:\Users\preml\Desktop\3rd sem\DAV\spotify-2023.
↪csv', encoding='ISO-8859-1')

# Visualization for artist name and their number of tracks


artist_tracks = spotify_data.groupby('artist(s)_name')['track_name'].nunique().
↪sort_values(ascending=False).head(10)

plt.figure(figsize=(14, 7))
sns.barplot(x=artist_tracks.values, y=artist_tracks.index, hue=artist_tracks.
↪index, palette='muted')

plt.title('Top 10 Artists with the Most Tracks')


plt.xlabel('Number of Tracks')
plt.ylabel('Artist(s) Name')
plt.show()

3
COMPUTER SCIENCE 4

Result:

The visualization highlights the top 10 artists with the most unique tracks in the dataset. These
artists demonstrate prolific output, with each having a significant number of tracks to their name.
While the number of tracks is indicative of an artist's productivity, it's essential to consider other
factors, such as streaming numbers or listener engagement, to gauge an artist's overall impact
and popularity. Nonetheless, the data underscores the diversity and richness of the music
landscape, showcasing artists who have made substantial contributions in terms of content
creation.

1.0.2 The histogram for the distribution of ‘danceability’

[14]: # the histogram for the distribution of 'danceability'


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset


df_spotify = pd.read_csv(r'C:\Users\preml\Desktop\3rd sem\DAV\spotify-2023.
↪csv', encoding='ISO-8859-1')

# Create a histogram for the distribution of 'danceability_%' across the tracks


plt.figure(figsize=(12, 6))
sns.histplot(df_spotify['danceability_%'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of Danceability Percentage')
plt.xlabel('Danceability (%)')
plt.ylabel('Frequency')
plt.show()

4
COMPUTER SCIENCE 5

Result

This line chart represents the average danceability of tracks per year, providing in- sights into the
evolution of this musical characteristic over time. The histogram underscores the diverse nature
of danceability across the tracks, providing a quantitative overview of how danceable the music
in the dataset tends to be. This analysis can serve as a foundation for further exploration into the
relationship between danceability and other musical attributes or listener preferences.

1.0.3 Distribution of energy percentages for the tracks

[16]: import pandas as pd

#Load the data


spotify_data = pd.read_csv(r'C:\Users\preml\Desktop\3rd sem\DAV\spotify-2023.
↪csv', encoding='ISO-8859-1')

# Plotting the distribution of the 'energy_%' column


plt.figure(figsize=(10, 6))
plt.hist(spotify_data['energy_%'], bins=30, color='lightgreen', alpha=0.7)
plt.title('Distribution of Energy')
plt.xlabel('Energy (%)')
plt.ylabel('Frequency')
plt.show()

5
COMPUTER SCIENCE 6

Result

The histogram provides a quantitative overview of the energy levels present in the music tracks.
This analysis can offer insights into the overall vibe or intensity of the music dataset, aiding in
further exploration or comparison with other musical attributes.

[5]: import pandas as pd

# Load the dataset


df_spotify = pd.read_csv(r'C:\Users\preml\Desktop\3rd sem\DAV\spotify-2023.
↪csv', encoding='ISO-8859-1')

# Let's visualize the average danceability per year


import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
avg_danceability_per_year.plot(kind='line', marker='o', color='orange')
plt.title('Average Danceability per Year')
plt.xlabel('Year')
plt.ylabel('Average Danceability')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

6
COMPUTER SCIENCE 7

Result

In summary, the visualization offers a chronological perspective on how average danceability has
evolved over the years. This analysis can be instrumental for music analysts, researchers, or
enthusiasts aiming to understand temporal patterns in musical attributes and their potential
correlations with broader cultural or industry shifts.

The line chart above illustrates the total streams per year, providing a visual repre-
sentation of the changes in streaming volumes over time.
[6]: import pandas as pd

df_spotify['streams'] = pd.to_numeric(df_spotify['streams'], errors='coerce')

# Now let's try to plot the total streams per year again
import matplotlib.pyplot as plt

# Group by 'released_year' and sum the streams


streams_per_year = df_spotify.groupby('released_year')['streams'].sum()

plt.figure(figsize=(14, 7))
streams_per_year.plot(kind='line', marker='o', color='purple')
plt.title('Total Streams per Year')
plt.xlabel('Year')
plt.ylabel('Total Streams')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

7
COMPUTER SCIENCE 8

Result

The visualization offers a comprehensive overview of the total streaming landscape over the
years, reflecting both general trends and specific anomalies. Such insights can be invaluable for
stakeholders in the music industry, helping them understand consumption patterns and make
informed decisions related to content promotion, artist collaborations, and platform strategies.

1.0.4 This visualization helps to understand the distribution of music releases over
the years included in the dataset.

[15]: # Plotting the distribution of the 'released_year' column


import pandas as pd
#Load the dataset
spotify_data = pd.read_csv(r'C:\Users\preml\Desktop\3rd sem\DAV\spotify-2023.
↪csv', encoding='ISO-8859-1')

plt.figure(figsize=(10, 6))
spotify_data['released_year'].value_counts().sort_index().plot(kind='bar',␣
↪color='skyblue')

plt.title('Number of Tracks Released Each Year')


plt.xlabel('Year')
plt.ylabel('Count')
plt.show()

8
COMPUTER SCIENCE 9

Result

The visualization offers a comprehensive overview of the total streaming landscape over the
years, reflecting both general trends and specific anomalies. Such insights can be invaluable for
stakeholders in the music industry, helping them understand consumption patterns and make
informed decisions related to content promotion, artist collaborations, and platform strategies.

[7]: import seaborn as sns


import pandas as pd
import matplotlib.pyplot as plt

# Set up the design of the plots


sns.set(style='whitegrid')

# Plotting the relationship between


danceability, valence, and energy
plt.figure(figsize=(10, 6))
sns.scatterplot(data=spotify_data,
x='danceability_%', y='valence_%',
size='energy_%', hue='energy_%',
palette='coolwarm', sizes=(20,200))
plt.title('Relationship between
Danceability, Valence, and Energy')
plt.xlabel('Danceability (%)')
plt.ylabel('Valence (%)')
plt.legend(title='Energy (%)',
loc='upper right')
plt.grid(True)
plt.show()
9
COMPUTER SCIENCE 10

Result:

This scatter plot helps to understand how these three attributes correlate with each other across different
songs in the dataset. The size and color of the points represent the energy level, providing a multi-
dimensional view of the music characteristics.

10
COMPUTER SCIENCE 11

11

You might also like