0% found this document useful (0 votes)
210 views

Python - Adv - 2 - Jupyter Notebook (Student)

The document discusses Pandas, an open source data analysis and manipulation library for Python. It introduces Pandas Series as the basic data structure, which are like dictionaries with added functionality. Series can be initialized from lists, dictionaries or custom indices. Elements can be selected from Series using .loc for labels or .iloc for integer positions. Series can be combined by appending, though this may result in duplicate indices.

Uploaded by

Mesbahur Rahman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
210 views

Python - Adv - 2 - Jupyter Notebook (Student)

The document discusses Pandas, an open source data analysis and manipulation library for Python. It introduces Pandas Series as the basic data structure, which are like dictionaries with added functionality. Series can be initialized from lists, dictionaries or custom indices. Elements can be selected from Series using .loc for labels or .iloc for integer positions. Series can be combined by appending, though this may result in duplicate indices.

Uploaded by

Mesbahur Rahman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: #!conda install pandas


import pandas as pd
from IPython.display import display

In [ ]: import numpy as np

Pandas
Pandas
Introduction to Pandas
Series
Initializing Series
Selecting Elements
loc
iloc
Combining Series
Exercises
Exercise 1
Exercise 2
Exercise 3
DataFrames
Creating DataFrames
Series as Rows
Series as Columns
Summary
Pandas Pretty Print in Jupyter
Importing and Exporting Data
Reading CSV
Writing CSV
Selecting Data
Exercises
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Data Processing
Aggregation
Arithmetic
Grouping
Unique and Duplicate Values
unique

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 1/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

duplicate
Exercises
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Merge Data Frames
Exercises
Exercise 1
Reshaping Data Frames
Exploratory Data Analysis
Exercise 1
Exercise 2
Exercise 3
Exercise 4
Stacking and Unstacking Data Frames
Exercise 5
Exercise 6

Introduction to Pandas
Pandas (http://pandas.pydata.org/) is a software library written for the Python programming
language for data manipulation and analysis. In particular, it offers data structures and operations
for manipulating numerical tables and time series. Pandas is free software released under the
three-clause BSD license. The name is derived from the term panel data, an econometrics term for
multidimensional structured data sets.

At it's core, Pandas consists of NumPy arrays and additional functions to perform typical data
analysis tasks.

Resources:

Pandas Documentation (http://pandas.pydata.org/pandas-docs/stable/index.html), especially


10 minutes to pandas (http://pandas.pydata.org/pandas-docs/stable/10min.html)
Hernan Rojas' learn-pandas (https://bitbucket.org/hrojas/learn-pandas)
Harvard CS109 lab1 content (https://github.com/cs109/2015lab1)

Series
Series form the basis of Pandas. They are essentially Python dictionaries with some added bells
and whistles. However, Pandas Series 'keys' are called indices.

Initializing Series
Series can be initialized from Python objects like lists or tuples. If only values are given, Pandas
generates default indices.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 2/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: animals = ['Tiger', 'Bear', 'Moose']


pd.Series(animals)

In [ ]: numbers = [1, 2, 3]


pd.Series(numbers)

Series can be mixed type

In [ ]: # Create a mixed series


mixed = [1, 2, "Three"]
print(pd.Series(mixed))
print()
print(type(mixed[0]))
print(type(mixed[1]))
print(type(mixed[2]))

Series also support missing values via the None type.

In [ ]: #create a pandas series with None


#observe the dtype
animals = ['Tiger', 'Bear', None]
print(pd.Series(animals))
print("")
print(type(animals[0]))
print(type(animals[1]))
print(type(animals[2]))

In [ ]: numbers = [1, 2, None]


print(pd.Series(numbers))
print("")
print(type(numbers[0]))
print(type(numbers[1]))
print(type(numbers[2]))

We can define custom keys during initialization.

In [ ]: sports = pd.Series(


data=["Bhutan", "Scotland", "Japan", "South Korea"],
index=["Archery", "Golf", "Sumo", "Taekwondo"])
print(sports)

Alternatively, Series can also be initialized with dictionaries. Indices are then generated from the
dictionary keys.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 3/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: #create a pandas series from dictionary


sports = pd.Series({
'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'})
print(sports)

We can list values and indices of series.

In [ ]: print(sports.index)
print(sports.values)

Series type

In [ ]: type(sports)

Selecting Elements
As a result of iterative development of the Pandas library, there are several ways to select
elements of a Series. Most of them are considered "legacy", however, and the best practice is to
use *.loc[...] and *.iloc[...] . Take care to use the square brackets with loc and
iloc , not the regular brackets as you would with functions.

loc

Select elements by their indices. If the index is invalid, either a TypeError or a KeyError will
be thrown.

In [ ]: print(sports.loc['Golf'])

iloc

Select elements by their numerical IDs, i.e. the n-th element.

In [ ]: print(sports.iloc[1])

If the indices were autogenerated then both loc and iloc seem to be identical.

In [ ]: sports_noindex = pd.Series(sports.values)


print(sports_noindex)
print("")
print(sports_noindex.loc[0])
print(sports_noindex.iloc[0])

Take care to keep your code semantically correct, however. For example, if the series is resorted,
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 4/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

the index of each element stays the same, but the ID changes!

In [ ]: sports_noindex_sorted = sports_noindex.sort_values()


print(sports_noindex_sorted)
print("")
print(sports_noindex_sorted.loc[1])
print(sports_noindex_sorted.iloc[1])

If you want to select by index then use loc , if you want to select by ID then use iloc . Do not
use them interchangeably just because they return the same results right now. This will eventually
lead to bugs in your code.

Combining Series
Series can be combined by appending one to another

In [ ]: s1 = pd.Series(["A", "B", "C"])


s2 = pd.Series(["D", "E", "F"])
print(s1)
print("")
print(s2)
print("")

s3 = s1.append(s2)
print(s3)

Notice the duplicate indices! Pandas permits this and selecting by loc will return both entries

In [ ]: print(s3.loc[0])
print("")
print(s3.iloc[0])

Also notice that if your selection of a Series results in a single entry, Pandas automatically converts
it to its base type, i.e. a string in this case. If the selection consists of more than 1 entry, however, a
Series is returned.

In [ ]: print(s3.loc[0])
print(type(s3.loc[0]))
print("")
print(s3.iloc[0])
print(type(s3.iloc[0]))

Exercises

Exercise 1

Create a pandas Series object from the following movie ratings

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 5/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

The Avengers: 9.2

Mr. Bean: 7.4

Garfield: 2.1

Star Wars The Force Awakens: 8.8

In [ ]: ### Your code here

In [ ]: # MC
movies = pd.Series(
data = [9.2, 7.4, 2.1, 8.8],
index = ["The Avengers", "Mr. Bean",
"Garfield", "Star Wars The Force Awakens"])
print(movies)
print()

movies = pd.Series({
"The Avengers": 9.2,
"Mr. Bean": 7.4,
"Garfield": 2.1,
"Star Wars The Force Awakens": 8.8})
print(movies)

Exercise 2

Select the rating for the movie 'Garfield'.

In [ ]: ### Your code here

In [ ]: # MC
movies.loc["Garfield"]

Exercise 3

Select the index of the 2 𝑛𝑑 entry

In [ ]: ### Your code here

In [ ]: # MC
movies.index[1]

DataFrames
Multiple series with common indices can form a data frame. A data frame is like a table, with rows
and columns (e.g., as in SQL or Excel).

. Animal Capital

India a b
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 6/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

. Animal Capital

Sweden a b

Each row usually denotes an entry in our data and each column a feature we're interested in.

Creating DataFrames

Series as Rows

Data frames can be created by glueing together Series objects as rows. In this case, the series
indices become the data frame columns

In [ ]: row1 = pd.Series(("Elephant", "New Delhi"), index=("Animal", "Capital"))


row2 = pd.Series(("Reindeer", "Stockholm"), index=("Animal", "Capital"))
print(row1)
print()
print(row2)

In [ ]: df = pd.DataFrame(data=[row1, row2], index=("India", "Sweden"))


df

As before, we can make use of Pandas' flexibility and replace the Series objects with a dictionary

In [ ]: df = pd.DataFrame(
data=[
{"Animal": "Elephant", "Capital": "New Delhi"},
{"Animal": "Reindeer", "Capital": "Stockholm"}],
index=("India", "Sweden"))
df

Or even a list of lists

In [ ]: df = pd.DataFrame(
data=[["Elephant", "New Delhi"],
["Reindeer", "Stockholm"]],
index=["India", "Sweden"],
columns=["Animal", "Capital"])
df

Make sure to match indices and columns when combining series. Pandas won't necessarily raise
an error but perform flexible merging.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 7/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: row1 = pd.Series(("Elephant", "New Delhi"), index=("Animal", "City"))


row2 = pd.Series(("Reindeer", "Stockholm"), index=("Animal", "Capital"))
print(row1)
print()
print(row2)
df = pd.DataFrame(data=[row1, row2], index=("India", "Sweden"))
display(df)

Series as Columns

We can also create data frames column-wise

In [ ]: col1 = pd.Series(["Elephant", "Reindeer"])


col2 = pd.Series(["New Delhi", "Stockholm"])
print(col1)
print()
print(col2)

In [ ]: df = pd.DataFrame(data=[col1, col2],


columns = ["Animal", "Capital" ],
index = ["India", "Sweden"])
df

Summary

Series are pasted together to become data frames. They can be pasted as:

rows: data=[series1, series2, ...]


columns: data=[series1, series2, ...]

We can use the same *.index and *.values attributes as for Series

In [ ]: print(df.index)
print(df.columns)
print(df.values)

Pandas Pretty Print in Jupyter


Jupyter has a 'pretty print' option for Pandas dataframes. Using print will print the dataframes in
Jupyter as they would appear in a standard console. But leaving it away or using IPython's
display function will render them as HTML tables

In [ ]: print(df)

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 8/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: from IPython.display import display


display(df)

Jupyter allows a shortcut for the display function. If we execute a Python command or line of
code that results in a data frame, Jupyter will assume we want to display it and do so using its
built-in function. Note, however, that it will only ever do this with the last relevant line in each cell.

In [ ]: df

Importing and Exporting Data


Most often we don't create data within our code but read it from external sources. Pandas has a
large collection of importing (and corresponding exporting) functions available.

Data Reader Writer

CSV read_csv to_csv

JSON read_json to_json

HTML read_html to_html

Local clipboard read_clipboard to_clipboard

Excel read_excel to_excel

HDF5 read_hdf to_hdf

Feather read_feather to_feather

Parquet read_parquet to_parquet

Msgpack read_msgpack to_msgpack

Stata read_stata to_stata

SAS read_sas

Python Picke Format read_pickle to_pickle

SQL read_sql to_sql

Google Big Query read_gbq to_gbq

http://pandas.pydata.org/pandas-docs/stable/io.html (http://pandas.pydata.org/pandas-
docs/stable/io.html)

Reading CSV

We will read a tabular CSV file as an example.

In [ ]: cars = pd.read_csv("../data/cars.csv")


cars

In [ ]: %pwd

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 9/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

We can also define one of the columns as an index column using either the column header (if it
exists) or the column ID (remember, Python starts counting at 0)

In [ ]: cars = pd.read_csv("../data/cars.csv", index_col="model")



# Use head() to print only the first few lines
cars.head()

In [ ]: cars = pd.read_csv("../data/cars.csv", index_col=0)


cars.head(3)

Writing CSV

Writing CSV files is as straightforward as it gets. Notice that these functions are now methods of
the specific objects, not of base Pandas

In [ ]: !ls
# For windows:
# !dir

In [ ]: cars.to_csv("cars2.csv")

In [ ]: !ls

Selecting Data
Selecting data from Pandas arrays works just as it did for NumPy arrays, except that loc and
iloc are necessary.

In [ ]: cars.head(10)

In [ ]: cars.iloc[9]

In [ ]: cars.iloc[5, 3]

In [ ]: cars.iloc[4:7]

In [ ]: cars.iloc[1:9:2]

As with Series, we can also select items by their index names.

In [ ]: cars.loc["Datsun 710"]

In [ ]: cars.loc[["Datsun 710", "Ferrari Dino"]]

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 10/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

Notice how a single entry is shown as a series but multiple entries as a data frame. This is
analogous to how a single entry of a series is shown as a base type and multiple entries as a
smaller series

Base Type --> Series --> Data Frame

Selecting columns can be done just as with dictionaries except that we can select multiple Pandas
columns simultaneously. As with row selection, selecting a single column results in a Series object
but selecting multiple columns results in a new DataFrame object.

In [ ]: cars["disp"]

In [ ]: cars[["disp", "wt"]].head()

Alternatively, we can also use the *.loc / .*iloc syntax. In this case, we have to include both
the row and column indices to select. As with base Python, the colon : instructs Pandas to select
all rows or columns

In [ ]: cars.loc[:, "disp"]

In [ ]: cars.loc["Mazda RX4", "disp"]

Take note that if we want to mix ID and index selection, we need to chain together loc and
iloc calls. There is no way to combine this into a single index tuple.

In [ ]: print(cars.iloc[4])
print()
print(cars.iloc[4].loc["mpg"])

In [ ]: print(cars.loc[:, "mpg"])


print()
print(cars.loc[:, "mpg"].iloc[4])

We can see the names of all columns with the columns property (notice that this is also an index
object, just as the row names is).

In [ ]: print(cars.columns)
print(cars.index)

We can also use boolean masks to select rows or columns, i.e.

cars.loc[True, True, False, True, False, ...]

However, as we're dealing with large datasets, typing them out by hand is suboptimal. So let's use
some simple boolean conditions instead.
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 11/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: # Pandas applies the operation to each individual entry


print(cars["mpg"] > 25)

In [ ]: # Use loc, not iloc, to select based on boolean masks


cars.loc[cars["mpg"] > 25]

We can also select specific rows of certain columns with boolean masks.

In [ ]: cars.loc[cars["mpg"] > 25, ["hp", "disp"]]

Exercises
Familiarize yourselves with data frame creation and handling.

Exercise 1

Manually create a dataframe from the following data. EmployeeID should be the index of the
dataframe. Try using different methods (e.g. nested dictionaries, list of lists, series objects as rows
or columns)

EmployeeID,EmployeeName,Salary,Department

2044,James,2500,Finance

1082,Hannah,4000,Sales

7386,Victoria,3700,IT

In [ ]: ### Your code here

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 12/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: # MC
df1 = pd.DataFrame({
"EmployeeName": {
"2044": "James",
"1082": "Hannah",
"7386": "Victoria"},
"Salary": {
"2044": 2500,
"1082": 4000,
"7386": 3700},
"Department": {
"2044": "Finance",
"1082": "Sales",
"7386": "IT"}})
display(df1)

df2 = pd.DataFrame(
data=[
["James", 2500, "Finance"],
["Hannah", 4000, "Sales"],
["Victoria", 3700, "IT"]],
index=["2044","1082","7386"],
columns=["EmployeeName", "Salary", "Department"])
display(df2)

# We can set the name of the index column as follows
df1.index.name = "EmployeeID"
display(df1)

Exercise 2

Read in the chocolate.csv data set and display the first 8 lines

In [ ]: ### Your code here

In [ ]: # MC
choc = pd.read_csv("../data/chocolate.csv")
choc.head(8)

Exercise 3

Select only the chocolates with "Congo" as the country of origin and show only the rating, the
cocoa percent, and the country of origin (to make sure we've selected the right products)

In [ ]: ### Your code here

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 13/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: # MC
choc.loc[
choc["Country of Origin"] == "Congo",
["Cocoa Percent", "Rating", "Country of Origin"]]

Exercise 4

Oh no! There was a mistake in the data entry. One of the products has a missing country of origin.
Please find it, replace it with "Venezuela", and save the fixed data frame as "chocolate_fixed.csv"

You can use *.isna() to identify which entry of a series is either NaN or None , e.g.
mySeries.isna()
You can assign values to data frames just like you would to lists, e.g. df.iloc[0, 5] = 15

In [ ]: choc.loc[choc["Country of Origin"].isna()]

In [ ]: # MC
choc.loc[choc["Country of Origin"].isna(), "Country of Origin"] = "Venezuela"
choc.to_csv("chocolate_fixed.csv")

Data Processing
Pandas contains many functions to process and transform data. These can be called either on
data frames or individual series. Describing every function in detail is far too time-consuming and
application-dependent. A thorough list and description of all Pandas functionality can be found
here: https://pandas.pydata.org/pandas-docs/stable/api.html (https://pandas.pydata.org/pandas-
docs/stable/api.html)

Many of the functions are more or less self-explanatory and/or well-documented

Aggregation

In [ ]: numbers = pd.Series([1, 2, 3, 4, 5, 5, 6, 6, 6])


print(numbers.sum())
print(numbers.mean())
print(numbers.max())
print(numbers.min())
print(numbers.idxmax())
print(numbers.idxmin())

Functions can be applied to series or data.frames. In the case of data frames, they are applied to
each row or column individually

In [ ]: df = pd.DataFrame([[1,1,1], [2,2,2], [3,3,3]])


df

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 14/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: df.sum()

In [ ]: df.idxmax()

We can decide whether the aggregation should occur along columns or rows. Note however, that
the syntax is confusing. axis=X indicates along which dimension the function will "travel". For
example, axis=columns indicates that all columns will be collapsed into the function, and the
function will be applied to individual rows. Likewise, axis=rows means that the function will travel
along rows and compute the aggregate value for each column individually.

In [ ]: df.sum(axis='columns')

In [ ]: df.sum(axis='rows')

The most important aggregation function is *.apply() , which applies an arbitrary function to
each row/column.

In [ ]: df.apply(lambda x: sum(x**2), axis='columns')

*.apply() is slower than the built-in functions, so should not be used for simple operations that
can also be solved with direct operations on data frames.

In [ ]: df = pd.DataFrame([[1,1,1], [2,2,2], [3,3,3]])



%timeit df.apply(lambda x: sum(x**2))
%timeit (df**2).sum()

Also take care that the function will be applied to all columns, regardless of type. The built-in
functions are clever enough to skip columns for which they are not defined.

In [ ]: df = pd.DataFrame({
"Age": [10, 12, 12],
"Name": ["Liz", "John", "Sam"]})
df.sum()

# Uncomment for exception
#df.apply(lambda x: sum(x**2), axis="rows")

Arithmetic
We can also perform element-wise operations on dataframe columns or rows, e.g.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 15/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: df = pd.DataFrame(
data=[[1,2,3], [4,5,6], [7,8,9]],
columns=["ColA", "ColB", "ColC"],
index=["RowA", "RowB", "RowC"])
df

In [ ]: df["ColA"] + df["ColB"]

In [ ]: # Pandas is smart enough to convert our list into a series and then add the two c
df["ColA"] + [10, 11, 12]

In [ ]: # Remember, both rows AND columns can be represented as Pandas series
df.loc["RowA"] * df.loc["RowB"]

Pandas adheres to the same broadcasting rules as NumPy

In [ ]: df = pd.DataFrame(
data=[[1,2], [3,4], [5,6]],
columns=["ColA", "ColB"],
index=["RowA", "RowB", "RowC"])
df

In [ ]: df * 2

In [ ]: df * [1, -1]

In [ ]: df.loc["RowA"] / 5

In [ ]: df["ColB"] ** 3

Grouping
A core functionality of Pandas is the ability to group data frames and apply functions to each
individual group. The function *.groupby(...) defines groups based on common labels.
Aggregators applied to this grouped data frame are then applied to each group individually.

In [ ]: df = pd.DataFrame({
"Height": [178, 182, 158, 167, 177, 174, 175, 185],
"Age": [24, 33, 32, 18, 21, 28, 22, 29],
"Gender": ["M", "M", "F", "F", "M", "F", "M", "F"]})
display(df)

In [ ]: print(df.groupby("Gender"))
display(df.groupby("Gender").mean())

We can also select columns without disturbing the grouping


localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 16/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: display(df.groupby("Gender")["Height"].mean())

A useful function is size() , which counts how large each of the groups is.

In [ ]: df.groupby("Gender").size()

Unique and Duplicate Values


Two functions can help us identify unique and duplicate values within Series objects. They are
aptly names unique() and duplicated() , respectively.

unique

*.unique() returns only unique values of a Series object.

In [ ]: s = pd.Series([1,2,3,2,3,4,3,5])
s.unique()

duplicate

*.duplicated() identifies duplicated values in Series objects and returns a boolean Series.
Entries that have already been seen are marked as True while new values are marked as
False .

In [ ]: s = pd.Series([1,2,3,2,3,4,3,5])
s.duplicated()

When applied to Dataframes, duplicated() compares entire rows for duplicates.

In [ ]: df = pd.DataFrame([
["Dog", 5],
["Cat", 4],
["Dog", 5],
["Fish", 2],
["Cat", 8]],
columns=["Animal", "Age"])
display(df)
display(df.duplicated())

To remove duplicate rows from a data frame we could use drop_duplicates() function.

In [ ]: df.drop_duplicates()

Exercises
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 17/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

Exercise 1

Load the "cars.csv" dataframe and calculate the average miles per gallon (column "mpg")

In [ ]: ### Your code here

In [ ]: # MC
cars = pd.read_csv("../data/cars.csv")
cars["mpg"].mean()

Exercise 2

Cars can have 4, 6, or 8 cylinders (column "cyl"). Find the mean miles per gallon (column "mpg")
for each of these classes without using the groupby(...) function.

BONUS: Write a function that takes the number of cylinders and returns the mean miles per gallon.

In [ ]: ### Your code here

In [ ]: # MC
# 4 cyl
print(cars.loc[cars["cyl"] == 4, "mpg"].mean())
# 6 cyl
print(cars.loc[cars["cyl"] == 6, "mpg"].mean())
# 8 cyl
print(cars.loc[cars["cyl"] == 8, "mpg"].mean())

def avg_mpg(df, cyl, col):
return df.loc[df["cyl"] == cyl, col].mean()

print(avg_mpg(cars, 8, "mpg"))

Exercise 3

Repeat the above exercise but this time make use of the groupby(...) function.

In [ ]: ### Your code here

In [ ]: # MC
cars.groupby("cyl")["mpg"].mean()

Exercise 4

Your client has a proprietary metric for car engine quality that is calculated as 𝑄 = 𝑤𝑡ℎ𝑝2 . Calculate
this metric for all cars and then find the average for cars with a manual (column "am" == 1) or
automatic (column "am" == 0) transmission.

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 18/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

HINT You can add the new metric as a column to your data frame via cars["q_metric'] =
... . Assignments to unknown column (or row) index names will result in new columns (or rows) to
be appended to the data frame.

In [ ]: ### Your code here

In [ ]: # MC
cars["q_metric"] = cars["hp"] / cars["wt"]**2

# Manual transmission
print(cars.loc[cars["am"] == 1, "q_metric"].mean())
print(cars.loc[cars["am"] == 0, "q_metric"].mean())

Merge Data Frames


Pandas data frames can be treated like SQL tables and joined.

In [ ]: sales = pd.DataFrame({


"Date": pd.date_range(start="2018-10-01", end="2018-10-07"),
"ItemID": ["A401", "C776", "A401", "FY554", "Y98R", "Y98R", "FY554"]})
sales

In [ ]: item_info = pd.DataFrame({


"ID": ["A401", "C776", "FY554", "Y98R"],
"Name": ["Toaster", "Vacuum Cleaner", "Washing Machine", "Clothes Iron"],
"Price": [25, 220, 540, 85]})
item_info

In [ ]: sales.merge(right=item_info, how="inner", left_on="ItemID", right_on="ID")

Merge types:

Inner: keep only rows with corresponding IDs found in both data frames
Left: use only rows with IDs found in the left data frame
Right: use only rows with IDs found in the right data frame
Outer: use all keys that are in at least one of the data frames. This is essentially the
combination of left and right joins

Missing data will be replaced by NaN values

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 19/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: sales = pd.DataFrame({


"Date": pd.date_range(start="2018-10-01", end="2018-10-07"),
"ItemID": ["A401", "ZZZC776", "A401", "ZZZFY554", "Y98R", "Y98R", "FY554"]})
display(sales)
item_info = pd.DataFrame({
"ID": ["A401", "C776", "FY554", "Y98R", "U1776"],
"Name": ["Toaster", "Vacuum Cleaner", "Washing Machine", "Clothes Iron", "Com
"Price": [25, 220, 540, 85, 899]})
display(item_info)

In [ ]: sales.merge(right=item_info, how="inner", left_on="ItemID", right_on="ID")

In [ ]: sales.merge(right=item_info, how="left", left_on="ItemID", right_on="ID")

In [ ]: sales.merge(right=item_info, how="right", left_on="ItemID", right_on="ID")

In [ ]: sales.merge(right=item_info, how="outer", left_on="ItemID", right_on="ID")

We can also merge on indices, either of one or both of the data frames

In [ ]: df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],


'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display(df1, df2)

In [ ]: df1 = df1.set_index("employee")


df2 = df2.set_index("employee")
display(df1, df2)

In [ ]: df1.merge(df2, left_index=True, right_index=True)

In [ ]: df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],


'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
df2 = df2.set_index("employee")
display(df1, df2)

In [ ]: df1.merge(df2, left_on="employee", right_index=True)

Exercises

Exercise 1

Merge the three data frames so that we have all information available for Bob, Alice, Kevin, and
Joshua in a single data frame
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 20/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: salaries = pd.DataFrame(


data=[["Bob", 5000], ["Alice", 4000], ["Kevin", 8000]],
columns=["Name", "Salary"])
departments = pd.DataFrame(
data=[["Kevin", "IT"], ["Joshua", "Data Science"], ["Bob", "Data Science"]],
columns=["Name", "Department"])
supervisors = pd.DataFrame(
data=[["IT", "Jeremy"], ["Data Science", "Darren"], ["Sales", "Yvonne"]],
columns=["Department", "Supervisor"])

In [ ]: display(salaries, departments, supervisors)

In [ ]: # MC
df1 = salaries.merge(departments, how="outer",
left_on="Name", right_on="Name")
df1

In [ ]: # MC
df2 = df1.merge(supervisors, how="left",
left_on="Department", right_on="Department")
df2

In [ ]: # MC
salaries.merge(departments, how="outer").merge(supervisors, how="left")

Reshaping Data Frames


In data analysis, we speak of 'tall' and 'wide' data formats when refering to the structure of a data
frame. A 'wide' data frame lists each feature in a separate column, e.g.

Name Age Hair Color

Joe 41 Brown

Carl 32 Blond

Mike 22 Brown

Sue 58 Black

Liz 27 Blond

A 'tall' data frame, on the other hand, collapses all features into a single column and uses an ID
("Name" in the example here) to keep track of which data point the feature value belongs to, e.g.

Name Feature Value

Joe Age 41

Carl Age 32

Mike Age 22

Sue Age 58

Liz Age 27
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 21/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

Name Feature Value

Joe Hair Color Brown

Carl Hair Color Blond

Mike Hair Color Brown

Sue Hair Color Black

Liz Hair Color Blond

Pandas lets us transform between these two formats.

In [ ]: df_wide = pd.DataFrame(


data=[
["Joe", 41, "Brown", 55.7, 157],
["Carl", 32, "Blond", 68.4, 177],
["Mike", 22, "Brown", 44.4, 158],
["Sue", 58, "Black", 82.2, 159],
["Liz", 27, "Blond", 55.1, 169]],
columns=["Name", "Age", "Hair Color", "Weight", "Height"])
df_wide

melt(...) transforms a wide-format dataframe into a tall format. The parameter id_vars
takes a single or tuple of column names to be used as IDs. The remaining columns are treated as
features and collapsed into (variable, value) pairs.

In [ ]: # Age,Hair Color,Weight and Height are implicitly assigned as value_vars


df_tall = df_wide.melt(id_vars="Name")
df_tall

Transforming a data frame from the tall to the wide format is called pivoting.

In [ ]: df_tall.pivot(index="Name", columns="variable", values="value")

Exploratory Data Analysis


A large part of our task as data scientists and analysts is to find patterns and interesting
phenomena within data. We can make use of Pandas' vast assortment of functions to help us with
this. The following exercises are designed to help you get an idea of the kind of questions you can
answer with Pandas.

This dataset describes all olympic athletes, the year they participated, the event they participated
in, and whether they received a medal. The data is split into two files, olympics_events.csv
and olympics_games.csv , describing the events and metadata of the games, respectively. The
data has been adjusted from https://www.kaggle.com/heesoo37/120-years-of-olympic-history-
athletes-and-results (https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-
results)

Exercise 1
localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 22/28
8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook
Exercise 1
1. Load the two files, olympics_events.csv and olympics_games.csv , and display the first
10 lines of each data frame.

In [ ]: ### Your code here

In [ ]: # MC
events = pd.read_csv("../data/olympics_events.csv")
games = pd.read_csv("../data/olympics_games.csv")
display(events.head(10), games.head(10))

2. Merge the two data frames on the GamesID and ID columns. Drop the now-unnecessary id-
columns afterwards.

In [ ]: ### Your code here

In [ ]: # MC
events = events.merge(
right=games, left_on="GamesID", right_on="ID", how="outer")
events = events.drop(["GamesID", "ID"], axis='columns')
display(events.head())

Exercise 2
History lesson! Malaysia's olympic nationality code is MAS . Prior to this, the Federation of Malaya
competed under the code MAL . Likewise, Sarawak and Sabah competed as North Borneo
( NBO ).

1. In which years did the Federation of Malaya compete in the Olympics?

In [ ]: ### Your code here

In [ ]: # MC
events.loc[events["Nationality"] == "MAL", "Year"].unique()

2. How many athletes did they send?

In [ ]: ### Your code here

In [ ]: # MC
len(events.loc[events["Nationality"] == "MAL", "Name"].unique())

In [ ]: events.loc[events["Nationality"] == "MAL"].nunique()

3. Who were the first countries to participate in the Olympic games (as per this data set)?

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 23/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: ### Your code here

In [ ]: # MC
earliest_year = events["Year"].min()
earliest_year

In [ ]: first_event = events.loc[events["Year"] == earliest_year]


first_event

In [ ]: countries = first_event["Nationality"].unique()


print(countries)

4. How many men and women has Malaysia ( MAS ) sent to the Olympics in total? Keep in mind
that athletes can participate in multiple events and multiple years. Each person should only
ever be counted once.

HINT: As we're only interested in athlete names and their genders, it's easiest to drop other
columns and not have to worry about them. Create a new data frame but don't overwrite events
as we'll need it for later exercises as well, though.

In [ ]: ### Your code here

In [ ]: # MC
# We're only interested in Malaysian athletes and their genders
athletes_mas = events.loc[events["Nationality"] == "MAS", ["Name", "Sex"]]

# Remove duplicates from this data frame
athletes_mas_unique = athletes_mas.drop_duplicates()

# Print the output
athletes_mas_unique["Sex"].value_counts()
# athletes_mas_unique["Sex"].size()

Exercise 3
1. How many men and women has Malaysia ( MAS ) sent to the Olympics each year?

Hint: This is a lot like the previous question except that athletes only count as duplicate now if
they compete in multiple events in the same year. An athlete competing in multiple years is no
longer duplicate.

In [ ]: ### Your code here

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 24/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: # MC
athletes_mas = events.loc[events["Nationality"] == "MAS", ["Name", "Sex", "Year"]

athletes_mas_unique = athletes_mas.drop_duplicates()

display(athletes_mas_unique.groupby("Year")["Sex"].value_counts())

# The previous answer feels a bit clumsy. We're grouping
# by gender and then counting instances in another column.
# A more elegant solution is to group by multiple columns
# and then simply check the size of each group
display(athletes_mas_unique.groupby(["Year", "Sex"]).size())

2. How does the ratio of male to female athletes sent by Malaysia compare to the global ratio for
the year 2016?

In [ ]: ### Your code here

In [ ]: # MC
events_2016 = events.loc[events["Year"] == 2016]
athletes_2016 = events_2016[["Name", "Sex", "Nationality"]]

athletes_2016 = athletes_2016.drop_duplicates()

athletes_2016.head()

In [ ]: # MC
# Global ratio
athlete_count_global = athletes_2016["Sex"].value_counts()
athlete_count_global

In [ ]: # MC
athlete_ratio_global = athlete_count_global.loc["M"] / athlete_count_global.loc["
print("Global male-to-female ratio: {:.2f}".format(athlete_ratio_global))

# Malaysian ratio
athletes_2016_mas = athletes_2016.loc[athletes_2016["Nationality"] == "MAS"]
athlete_count_mas = athletes_2016_mas["Sex"].value_counts()
athlete_ratio_mas = athlete_count_mas.loc["M"] / athlete_count_mas.loc["F"]
print("Malaysian male-to-female ratio: {:.2f}".format(athlete_ratio_mas))

Exercise 4
Let's start looking at some of the numerical data!

1. How many gold medals has each country won? How about Malaysia ( MAS )?

In [ ]: ### Your code here

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 25/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: # MC
medals = events.loc[~events["Medal"].isna()]
medal_table = medals.groupby(["Nationality", "Medal"]).size()
print(medal_table)
print()
print(medal_table.loc["MAS"])

Stacking and Unstacking Data Frames

The previous solution is in an acceptable format, but it's not the most human-friendly way to
present data. Instead, we can unstack our data and bring it into wide format.

In [ ]: medal_table_wide = medal_table.unstack(fill_value=0)


medal_table_wide

The opposite operation, *.stack() brings it back into the original long format.

In [ ]: medal_table_wide.stack()

Exercise 5
1. What is the median age of gold medalists?

In [ ]: ### Your code here

In [ ]: # MC
events.loc[events["Medal"] == "Gold", "Age"].median()

2. What is the median age of gold, silver, and bronze medalists for each individual sport?

In [ ]: ### Your code here

In [ ]: # MC
events.groupby(["Sport", "Medal"])["Age"].median().unstack()

3. Look at only swimmers. How has the mean weight of all competitors changed throughout the
years? Use *.plot() to get a visual sense of the trend.

In [ ]: ### Your code here

In [ ]: # MC
events_swimming = events.loc[events["Sport"] == "Swimming"]
events_swimming.groupby("Year")["Weight"].mean().plot()

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 26/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

4. What is the mean and standard deviation of the BMI of athletes in each sports discipline? The
BMI can be computed as

𝐻𝑒𝑖𝑔ℎ𝑡
𝐵𝑀𝐼 = 𝑊𝑒𝑖𝑔ℎ𝑡 / ( 100 )
2

with the values in this dataset. To solve this question, break it down into individual steps:

Calculate the BMI for all athletes


Group by 'Sport'
Calculate the mean and standard deviation of the BMI of the grouped data frame
Hint: Use `.agg([..., ...])` to apply "mean" and "std" (standard deviation) simultaneously.

In [ ]: ### Your code here

In [ ]: # MC
events["BMI"] = events["Weight"] / (events["Height"]/100)**2
bmi_table = events.groupby("Sport")["BMI"].agg(["mean", "std"])
bmi_table.sort_values("mean")

Exercise 6
1. What country has the most gold medals in wrestling?

In [ ]: ### Your code here

In [ ]: # MC
events.loc[
(events["Sport"] == "Wrestling") &
(events["Medal"] == "Gold"), "Nationality"].value_counts()

2. How many different types of events have ever been held for fencing?

In [ ]: ### Your code here

In [ ]: # MC
len(events.loc[events["Sport"] == "Fencing", "Event"].unique())

3. Typically, only one of each medal is awarded per year for each event. This is not the case for
team sports, however. If a team wins the gold, then each team member is awarded a gold
medal. What is the largest team to have ever been awarded gold medals for a single event in
a single year?

In [ ]: ### Your code here

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 27/28


8/22/22, 9:22 PM Python_Day5_MC - Jupyter Notebook

In [ ]: # MC
events.groupby(["Nationality", "Event", "Medal", "Year"]).size().idxmax()

localhost:8890/notebooks/2022/22Aug/PRJ63504 Capstone (Python)/CADS/Python for Analytics (Advanced)/Day 5/Python_Day5_MC.ipynb 28/28

You might also like