Lab Numpy Pandas Matplot
Lab Numpy Pandas Matplot
Matplotlib
BANA3020 Introduction to Programming with Python Fall 2024
Context:
This dataset contains audio statistics of the top 2000 tracks on Spotify from 2000-2019. The data
contains about 18 columns each describing the track and it’s qualities.
1 import pandas as pd
If your Anaconda environment doesn’t have Pandas installed, please follow this guide: Installing Pandas for
Anaconda.
To read a .csv (comma-separated-values) file with Pandas, use pd.read_csv(path function. This func-
tion takes in a string representing the file path and opens it as a Pandas DataFrame object.
The file path already given in the template notebook is ./songs_normalize.csv means read it from the
current directory. The dot . symbol represents the current directory.
You should store the imported data frame in a variable called dataset or df.
Pandas has a convenient built-in method for a data frame to do this called dataframe.describe(). Call
this method to see the output.
To access and retrieve a single column in a Pandas data frame as an array you can use syntax similar to
Python dictionary: dataframe[’column_name’].
Store these 2 new arrays as 2 new columns in your dataframe by using dictionary-like syntax, with the
new column names duration_sec and duration_min for seconds and minutes, respectively. Adding a new
column is just as simple as: dataframe[’new_column’]=my_array.
Next you should find what is the range of our song durations (difference between the longest and short-
est duration).
Finally, find the percentage of songs that have duration longer than the average value. To do this you
will need to use np.where function.
indices = np.where(condition)
This function returns a tuple, containing arrays of indices of elements in your array that satisfy the condition
given. In a 2-D array, the tuple will have 2 arrays corresponding to the row indices and column indices. In a
1-D array the tuple will contain only 1 array. For example:
negatives = np.where(a < 0)
Returns negatives = ([...],) which is a tuple containing a single array that holds indices of elements in
a that is negative. To access this array use negatives[0].
You should use where to find the indices of durations greater than the average. The percentage of songs that
have duration greater than average is the length of this array divided by the length of duration_sec times 100.
Next, find the song names that have durations over average. First you need to convert your column ar-
ray to a NumPy array using column.to_numpy() method. Next you can pass directly the result of where()
as index to a NumPy array to retrieve the elements in the array that satisfies the condition. For example if
you use:
a[np.where(a < 0)]
It will return an array of numbers in a that are smaller than zero.
Refer to Page 2 for an explanation of what each variable means. In our case, let’s see what is the correlation
between some pair of variables. In this task you need to rite the code to calculate:
To do this you just need to create an array xdata = your column for x and ydata = your column for
y, and provide them as arguments for plt.scatter. Optionally, if you want to make your plot look a bit
nicer you can look into using cmap to set a colourmap, and c to assign a dimension to map the colours. For
example you can use:
plt.scatter(xdata,ydata,cmap="plasma",c=xdata)
This will assign the x dimension to the plasma colour map. Refer to this page for a list of available colourmaps:
Matplotlib Colormap.