Rec 11

Recitation 11
Python Programming for Engineers
Tel-Aviv University / 0509-1820 / Fall 2021-2022

Agenda
Pandas
Data fetching
Data cleaning
Data analysis
Data visualzation
What is the Pandas package using for?
Calculate statistics and answer questions about (mostly tabular) data
e.g., What's the average, median, max, or min of each column?
Clean the data by doing things like removing missing values and ltering rows or
columns by some criteria
Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles,
and more.
Store the cleaned, transformed data back into a CSV, other le or a database
Dataframe
Dataframe is the primary pandas data structure.

A dataframe is a two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns).
What is the main difference from a numpy array?
Pandas enables arithmetic operations on both the rows and the columns of a
dataframe.
It can be thought of as a dict-like container for Series objects.
Basic operations (1)
Create a dataframe from a dictionary
In [ ]: import pandas as pd

import numpy as np # This is not mandatory for using Pandas
import matplotlib.pyplot as plt # For plotting later
In [ ]: d = {'Name': ["Rick", "Morty"], 'Dimension': [137, 132]}

df = pd.DataFrame(data=d)
print('"Raw" display. This is how dataframes usually displayed on console')
In [ ]: print('formatted display (only in Jupyter Notebooks)')

display(df)
Create a dataframe from a numpy array
In [ ]: df = pd.DataFrame(np.array([["Rick", 136, 1, True], ["Morty", 134, 3, True], [

"Summer", 130, 7, False]]),
columns=['Name', 'Dimension', 'Encounters', 'Is original']) #
Columns names
display(df)
In [ ]: print(df.columns)
In [ ]: print(df["Name"])
Kahoot! (1): What will be printed?
In [ ]: print(df["Dimension"] + df["Encounters"])

In [ ]: df["Dimension"].dtype
In [ ]: df2 = pd.DataFrame(np.array([[136, 1], [134, 3], [130, 7]]),

columns=['Dimension', 'Encounters']) # Columns names
In [ ]: df2["Dimension"].dtype
In [ ]: df2 = pd.DataFrame(np.array([[136, 1], [134, 3], [130, 7]]),

columns=['Dimension', 'Encounters']) # Column names
display(df2["Dimension"] + df2["Encounters"])
Make sure your dataframe types are correct!

iloc and loc
In [ ]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],index=[3, 4, 5])
display(df)
In [ ]: display(df.iloc[:2])
In [ ]: display(df.loc[:3])
Do not get confused between `df[]`, `df.loc[]` and `df.iloc[]`!

Pandas' classes
In [ ]: print(type(df))
print(type(df.iloc[0]))
print(type(df.iloc[:,0]))
Example: Countries of the world dataset
Analysis steps
1. Install pandas with anaconda as described here (via Pycharm)

(http://courses.cs.tau.ac.il/pyProg/2122a/resources/pycharm%20tutorial.pdf) or
here (on windows)
(https://docs.anaconda.com/anaconda/navigator/tutorials/pandas/)
2. Get basic information
3. Data cleaning – ll missing data
4. Basic data analysis
5. Data Visualization
1. Get basic information (and some more basic operations)
Read a csv
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
Write dataframe to csv
In [ ]: df.to_csv("files/countries-of-the-world_out.csv")
Print the top 3 rows using the head() function
In [ ]: display(df.head(3))
Note that NaN on the second row!

Print the last 5 rows using the tail() function
In [ ]: display(df.tail(5))
Print random 5 rows using the sample() function
In [ ]: display(df.sample(5))
Get statistics
In [ ]: df.info()
In [ ]: display(df.dtypes)
2. Data cleaning
ll missing values
Option 1: replace nan values with 0 (or any other constant value)
In [ ]: df=pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","Reg
ion","Population","Area"]]
display(df)
df = df.fillna(0)
display(df)
Option 2: replace nan values with the average of the column
In [ ]: df=pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","Reg
ion","Population","Area"]]
display(df)
df['Population'].fillna(df['Population'].mean(), inplace=True)
print(df['Population'].mean())
display(df)
Option 3: drop all rows that have any NaN value
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)
df = df.dropna()
display(df)
Convert Square Miles to Square Meters
Formula: 1 Square Mile =2.58 Square Kilometers
df['Area mile'] = df['Area'].apply(lambda x: x*2.58) # Apply this lambda functio
n on every cell in the Area column
display(df.loc[:,["Area mile","Area"]])
Add a new column
egion","Population","Area", "Birthrate", "Deathrate"]]
display(df)
df["Growing Rate"] = df.apply(lambda row: row["Birthrate"] - row["Deathrate"],

axis=1)
display(df)
A detailed (and longer) version
In [ ]: def get_growing_rate(row):

return row["Birthrate"] - row["Deathrate"]
In [ ]: df["Growing Rate"] = df.apply(get_growing_rate, axis=1)

display(df)
Add a new country (row) to the dataframe
israel = {"Country":"Israel", "Region":"ASIA","Population": 8000000}
df = df.append(israel, ignore_index=True)
display(df.iloc[4].loc['Area'])
In [ ]: display(df)
NaNs
In [ ]: nan_val=df.iloc[4].loc['Area']
print(nan_val)
In [ ]: print(np.nan=="nan")
print(np.nan=="NaN")
print(nan_val==np.nan)
print(np.isnan(nan_val))
print(pd.isnull(nan_val))
print(np.isnan(np.nan))
print(pd.isnull(np.nan))
Delete the Area column
display(df)
df = df.drop("Area",1)
# Axes 1 for column. Use df.drop([“A”, “B”], 1) to drop both the A and B column
s
display(df)
Delete the country Angola
df = df[df.Country != "Angola"]
display(df)
# This is equivalent
# df = df[df['Country'] != "Angola"]
Leave ony country Angola and Afghanistan
display(df)
countries = ["Angola","Afghanistan"]
df = df[df.Country.isin(countries)]
display(df)
What will happen if we add ~ before isin ?
Kahoot! (6): Which countries will be printed?
display(df)
countries = ["Angola","Afghanistan"]
df = df[~df.Country.isin(countries)] # ~ is for not.
display(df.loc[:,"Country"])
Join Tables
Given a new table with the same column names, merge the two tables into a single
table
In [ ]: df1 = pd.read_csv("files/countries-of-the-world.csv").iloc[:3]

df2 = pd.read_csv("files/countries-of-the-world.csv").iloc[3:6]
display(df1)
display(df2)
frames = [df1, df2]
result = pd.concat(frames)
display(result)
Inner join – consider only the intersection of the tables
How many rows are in the intersection?
Outer join – consider the union of the table, ll with Nan missing values.
How many rows are in the union?

In [ ]: df1 = pd.read_csv("files/ex1.csv")
display(df1)
df2 = pd.read_csv("files/ex2.csv")
display(df2)
In [ ]: pd.concat([df1,df2], axis=1)

In [ ]: df1 = pd.read_csv("files/ex1.csv")
display(df1)
df2 = pd.read_csv("files/ex2.csv")
display(df2)
In [ ]: inner = pd.merge(df1, df2, on='Product', how = 'inner')

display(inner)
Kahoot! (7): How many rows in the printed table?
In [ ]: outer = pd.merge(df1, df2, on='Product', how = 'outer')

display(outer)
3. Data analysis
Which country has the largest population?
1. Find the label of the row with the maximum value in the population column
(idxmax())
2. Get the country name of the row with the obtained label (loc)
In [ ]: label_max_pop = df["Population"].idxmax()

display(df.loc[label_max_pop]["Country"])
How many countries have more than 1M people?
1. Select all rows with Population > 1,000,000

2. Count the number of selected rows (use len() to get the num of rows)
In [ ]: print(len(df.loc[df['Population'] > 1000000]))
In [ ]: print(len(df[df['Population'] > 1000000]))
In [ ]: print((df['Population'] > 1000000).sum())

How many countries in Oceania have more than 1M peoples?
Select all countries from Oceania, with Population > 1,000,000

Count the number of selected rows (use len() to get the num of rows)
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[:,["Country","Region",
"Population","Area"]]
df = df[(df['Population'] > 1000000) & (df['Region'] == "OCEANIA")]
display(df)
print(len(df))
Get all countries in the Oceania with Deathrate > 7
df = df[(df['Region'] == "OCEANIA") & (df['Deathrate'] > 7)]
display(df)
Sort the countries according to the population size
df = df.sort_values(['Population'], ascending=True)
display(df)
How to sort countries with the same/NaN population values?
Break ties according to area size
df = df.sort_values(['Population', 'Area'], ascending=True)
display(df)
groupby - extra
What is the average population in each region?
Groupby region
Get the mean of the population column in every group
In [ ]: display(df.groupby(['Region'])['Population'].mean())
Which country in each region has the largest population?
1. Grouby region
2. Get the country with the maximum population in every group
In [ ]: regions = df.groupby(["Region"])

for name, group in regions:
label_max = group["Population"].idxmax()
print(name, df.loc[label_max]["Country"])
Yet, better:
In [ ]: display(df.loc[df.groupby(["Region"])['Population'].idxmax()])
Print the highest mean Deathrate among all regions
Compute the mean deathrate per region

Get the maximum value
regions = df.groupby(['Region'])
print(regions['Deathrate'].mean().max())
display(df)
4. Data visualization
Plot an histogram of the GDP column: create a matplotlib gure from Dataframe
axarr = df.hist(column='GDP ($ per capita)',bins=10, grid=False, color='#86bf91'
)
for ax in axarr.flatten():
ax.set_xlabel("GDP")
ax.set_ylabel("Count")
plt.show()
Yet, another (more useful) way: pass a Series object to matplotlib
fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.hist(df.loc[:,'GDP ($ per capita)'],bins=10, color='#86bf91', label="GDP ($ p
er capita)")
ax.set_xlabel("GDP")
ax.set_ylabel("Count")
ax.legend()
plt.show()
Plot an histogram of the Birthrate and the Deathrate columns: create a matplotlib gure from
Dataframe
df = df[["Birthrate", "Deathrate"]]
ax = df.plot.hist(bins=12, alpha=0.5) # alpha for transparent colors
plt.show()
Yet, another (more useful) way: pass a Series object to matplotlib
ax.hist(df["Birthrate"], bins=12, alpha=0.5, label='Birthrate') # alpha for tran
sparent colors
ax.hist(df["Deathrate"], bins=12, alpha=0.5, label='Deathrate') # alpha for tran
sparent colors
ax.legend()
plt.show()
Create a boxplot of the Infant mortality, Birthrate and Deathrate columns: create a matplotlib
gure from Dataframe
columns = ['Infant mortality (per 1000 births)','Birthrate','Deathrate']
boxplot = df.boxplot(column=columns)
plt.show()
Yet, another (useful) way: pass a Series object to matplotlib
columns = ['Infant mortality (per 1000 births)', 'Birthrate','Deathrate'] # , 'I
nfant mortality (per 1000 births)'
ax.boxplot(df.loc[:,columns].dropna(axis=0), labels=columns)
plt.show()
Pandas summary
We saw how to:

Create a dataframe (from a le, dict, numpy array)
basic information (head(), tail(), dtypes, info())
Clean the data (add/remove columns/rows, fillna())
Merge tables (concat(), merge())
Analyze the dataset (select rows based on a condition, groupby(), mean(),
idxmax(), etc.)
Visualize the data (hist(), boxplot())
For more info visit the pandas documentation (https://pandas.pydata.org/pandas-docs/stable/)

website.
The end! (excpect from exam practice session 🤓)

Rec 11

Uploaded by

Copyright:

Available Formats

Rec 11

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rec 11

Uploaded by

Copyright:

Available Formats

Recitation 11

Python Programming for Engineers

Tel-Aviv University / 0509-1820 / Fall 2021-2022

Dataframe is the primary pandas data structure.

Create a dataframe from a dictionary

In [ ]: import pandas as pd

In [ ]: d = {'Name': ["Rick", "Morty"], 'Dimension': [137, 132]}

In [ ]: print('formatted display (only in Jupyter Notebooks)')

In [ ]: df = pd.DataFrame(np.array([["Rick", 136, 1, True], ["Morty", 134, 3, True], [

Kahoot! (1): What will be printed?

In [ ]: print(df["Dimension"] + df["Encounters"])

In [ ]: df2 = pd.DataFrame(np.array([[136, 1], [134, 3], [130, 7]]),

Kahoot! (2): What will be printed?

In [ ]: df2 = pd.DataFrame(np.array([[136, 1], [134, 3], [130, 7]]),

Make sure your dataframe types are correct!

Do not get confused between `df[]`, `df.loc[]` and `df.iloc[]`!

Kahoot! (3): What will be printed?

1. Install pandas with anaconda as described here (via Pycharm)

Write dataframe to csv

Print the top 3 rows using the head() function

Note that NaN on the second row!

Print random 5 rows using the sample() function

Formula: 1 Square Mile =2.58 Square Kilometers

df["Growing Rate"] = df.apply(lambda row: row["Birthrate"] - row["Deathrate"],

In [ ]: def get_growing_rate(row):

In [ ]: df["Growing Rate"] = df.apply(get_growing_rate, axis=1)

Kahoot! (4): What will be printed?

Kahoot! (5): What will be printed?

Kahoot! (6): Which countries will be printed?

In [ ]: df1 = pd.read_csv("files/countries-of-the-world.csv").iloc[:3]

How many rows are in the intersection?

How many rows are in the union?

In [ ]: pd.concat([df1,df2], axis=1)

In [ ]: inner = pd.merge(df1, df2, on='Product', how = 'inner')

Kahoot! (7): How many rows in the printed table?

In [ ]: outer = pd.merge(df1, df2, on='Product', how = 'outer')

Which country has the largest population?

In [ ]: label_max_pop = df["Population"].idxmax()

1. Select all rows with Population > 1,000,000

In [ ]: print(len(df.loc[df['Population'] > 1000000]))

In [ ]: print(len(df[df['Population'] > 1000000]))

In [ ]: print((df['Population'] > 1000000).sum())

Select all countries from Oceania, with Population > 1,000,000

Break ties according to area size

What is the average population in each region?

In [ ]: regions = df.groupby(["Region"])

Compute the mean deathrate per region

We saw how to:

For more info visit the pandas documentation (https://pandas.pydata.org/pandas-docs/stable/)

You might also like