0% found this document useful (0 votes)
52 views

Lab Numpy Pandas Matplot

The document outlines a lab practice for a programming course focused on using Python libraries NumPy, Pandas, and Matplotlib for data analysis. It includes submission instructions, an introduction to a dataset of top Spotify tracks from 2000-2019, and eight tasks that guide students through importing data, performing statistical analysis, and visualizing correlations. Students are required to follow academic honesty guidelines and submit their work in a specified format by the deadline.

Uploaded by

Yến Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views

Lab Numpy Pandas Matplot

The document outlines a lab practice for a programming course focused on using Python libraries NumPy, Pandas, and Matplotlib for data analysis. It includes submission instructions, an introduction to a dataset of top Spotify tracks from 2000-2019, and eight tasks that guide students through importing data, performing statistical analysis, and visualizing correlations. Students are required to follow academic honesty guidelines and submit their work in a specified format by the deadline.

Uploaded by

Yến Lê
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Lab Practice : NumPy, Pandas, and

Matplotlib
BANA3020 Introduction to Programming with Python Fall 2024

Lab Practice Submission Instructions:


• This is an individual lab practice and will typically be assigned in the laboratory (computer lab). You can
use your personal computer but all quizzes and practical exams will be performed with a lab computer.
• Your program should work correctly on all inputs. If there are any specifications about how the program
should be written (or how the output should appear), those specifications should be followed.
• Your code and functions/modules should be appropriately commented. However, try to avoid making
your code overly busy (e.g., include a comment on every line).
• Variables and functions should have meaningful names, and code should be organized into function-
s/methods where appropriate.
• Academic honesty is required in all work you submit to be graded. You should NOT copy or share your
code with other students to avoid plagiarism issues.
• Use the template provided to prepare your solutions.
• You should upload your .py file(s) to Canvas according to deadline.
• Submit separate .py file for each Lab problem with the following naming format: Lab12_Q1.py. Note:
If you are working on Jupyter Notebook, you need to download/convert it to Python .py file for sub-
mission.
• Late submission of lab practice without an approved extension will incur penalties.

Lab Practice Numpy Pandas Matplotlib Page 1


Introduction to Exploring a Data Set with Python
In the lecture you have been introduced to NumPy, Pandas, and Matplotlib. These make up the essential
toolkit for data analysis in Python. In this lab, you will be introduced to how to use these tools to work with a
data set. This lab contains 8 small tasks that aims to give you a tutorial on how to use these powerful Python
libraries.
We will be working with the Top Hits Spotify from 2000-2019 data set from Kaggle. Three CSV file for
the data set is provided on Canvas. The description about the data set provided on the site is as follows:

Context:
This dataset contains audio statistics of the top 2000 tracks on Spotify from 2000-2019. The data
contains about 18 columns each describing the track and it’s qualities.

Columns that we will use:


• song: Name of the Track.
• duration_ms: Duration of the track in milliseconds.
• year: Release Year of the track.
• popularity: The higher the value the more popular the song is.
• danceability: Danceability describes how suitable a track is for dancing.
• energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity
and activity.
• loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across
the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality
of a sound that is the primary psychological correlate of physical strength (amplitude).
• acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0
represents high confidence the track is acoustic.
• tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical termi-
nology, tempo is the speed or pace of a given piece and derives directly from the average beat
duration.

Lab Practice Numpy Pandas Matplotlib Page 2


Installing Pandas
Import pandas library using the following code:

1 import pandas as pd

If your Anaconda environment doesn’t have Pandas installed, please follow this guide: Installing Pandas for
Anaconda.

Task 1 - Import the data set


In this task, you need to download the songs_normalize.csv file from Canvas, place it in the same directory
(folder) as Python lab file and import it into Python with Pandas. If you want to see what the data set looks
like, you can open it in Microsoft Excel or Google Sheets.

To read a .csv (comma-separated-values) file with Pandas, use pd.read_csv(path function. This func-
tion takes in a string representing the file path and opens it as a Pandas DataFrame object.

The file path already given in the template notebook is ./songs_normalize.csv means read it from the
current directory. The dot . symbol represents the current directory.

You should store the imported data frame in a variable called dataset or df.

Task 2 - Preview the data frame


Let’s see what the data set looks like when imported into Python! We can quickly preview the first 5 rows of
the data set along with the header (column names) using just 1 line of code.

Use the dataframe.head() method provided by Pandas to do this.

Task 3 - Descriptive statistics


Usually when investigating a new data set, we would like to quickly look at the basic descriptive statistics of
each variable (column) in our data set.

Pandas has a convenient built-in method for a data frame to do this called dataframe.describe(). Call
this method to see the output.

Task 4 - Miliseconds to seconds


From Task 1 we could see that the duration is currently stored in miliseconds, which is a bit cumbersome to
read for us. Create a new column to store the duration in seconds instead.

To access and retrieve a single column in a Pandas data frame as an array you can use syntax similar to
Python dictionary: dataframe[’column_name’].

Lab Practice Numpy Pandas Matplotlib Page 3


Use NumPy to calculate the new duration in seconds and in minutes. To perform element-wise operations on
arrays with NumPy, simply use the array as a term in your mathematical expression. E.g. array / 5 will
divide each element in the array by 5.

Store these 2 new arrays as 2 new columns in your dataframe by using dictionary-like syntax, with the
new column names duration_sec and duration_min for seconds and minutes, respectively. Adding a new
column is just as simple as: dataframe[’new_column’]=my_array.

Task 5 - Duration statistics


We also want to see some basic statistics about the duration such as its mean (average), longest duration
(max) and shortest duration (min). Use np.mean(array), np.max(array), np.min(array) to get these
values and print it.

Next you should find what is the range of our song durations (difference between the longest and short-
est duration).

Finally, find the percentage of songs that have duration longer than the average value. To do this you
will need to use np.where function.
indices = np.where(condition)
This function returns a tuple, containing arrays of indices of elements in your array that satisfy the condition
given. In a 2-D array, the tuple will have 2 arrays corresponding to the row indices and column indices. In a
1-D array the tuple will contain only 1 array. For example:
negatives = np.where(a < 0)
Returns negatives = ([...],) which is a tuple containing a single array that holds indices of elements in
a that is negative. To access this array use negatives[0].

You should use where to find the indices of durations greater than the average. The percentage of songs that
have duration greater than average is the length of this array divided by the length of duration_sec times 100.

Next, find the song names that have durations over average. First you need to convert your column ar-
ray to a NumPy array using column.to_numpy() method. Next you can pass directly the result of where()
as index to a NumPy array to retrieve the elements in the array that satisfies the condition. For example if
you use:
a[np.where(a < 0)]
It will return an array of numbers in a that are smaller than zero.

Task 6 - Pearson correlation


NumPy also provides other useful statistical tools such as correlation computation for independent vari-
ables. The Pearson correlation coefficient measures the linear association between variables. In NumPy,
the corrcoef(x,y) function gives a Pearson correlation matrix:
" #
corr(x,x) corr(x,y)
corr(y,x) corr(y,y)
We want to get the correlation between the variables x and y so we choose either the [0,1] element or
[1,0] (corr(x,y) and corr(y,x) are the same). The correlation can be in the [-1,1]. With 0 meaning

Lab Practice Numpy Pandas Matplotlib Page 4


no correlation, 1 meaning strong positive correlation and -1 meaning strong negative correlation.

Refer to Page 2 for an explanation of what each variable means. In our case, let’s see what is the correlation
between some pair of variables. In this task you need to rite the code to calculate:

• Correlation between energy and tempo and print it.

• Correlation between energy and loudness and print it.

• Correlation between energy and acousticness and print it.

Task 7 - Finding unexpected entries


Our data set is title Top Hits Spotify 2000-2019, so we would expect the songs included to be within this
time range. However, there are songs in this data set that are outside of this range. Retrieve the list of names
for these songs. Use where to find and print names of songs whose year value is less than 2000 and songs
whose year value is greater than 2019.

Task 8 - Plotting correlations


Use matplotlib to visualize the correlations between our variables using scatter plots. You can use plt.scatter(xdata,ydata
You will need to write the code to plot:

• energy vs. loudness.

• energy vs. acousticness.

• energy vs. danceability.

• tempo vs. popularity.

• speechiness vs. popularity.

To do this you just need to create an array xdata = your column for x and ydata = your column for
y, and provide them as arguments for plt.scatter. Optionally, if you want to make your plot look a bit
nicer you can look into using cmap to set a colourmap, and c to assign a dimension to map the colours. For
example you can use:
plt.scatter(xdata,ydata,cmap="plasma",c=xdata)
This will assign the x dimension to the plasma colour map. Refer to this page for a list of available colourmaps:
Matplotlib Colormap.

Lab Practice Numpy Pandas Matplotlib Page 5

You might also like