Rec 11

Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Recitation 11

Python Programming for Engineers

Tel-Aviv University / 0509-1820 / Fall 2021-2022


Agenda
Pandas

Data fetching

Data cleaning

Data analysis

Data visualzation
What is the Pandas package using for?
Calculate statistics and answer questions about (mostly tabular) data
e.g., What's the average, median, max, or min of each column?
Clean the data by doing things like removing missing values and ltering rows or
columns by some criteria
Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles,
and more.
Store the cleaned, transformed data back into a CSV, other le or a database
Dataframe

Dataframe is the primary pandas data structure.


A dataframe is a two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns).
What is the main difference from a numpy array?
Pandas enables arithmetic operations on both the rows and the columns of a
dataframe.
It can be thought of as a dict-like container for Series objects.
Basic operations (1)

Create a dataframe from a dictionary

In [ ]: import pandas as pd


import numpy as np # This is not mandatory for using Pandas
import matplotlib.pyplot as plt # For plotting later

In [ ]: d = {'Name': ["Rick", "Morty"], 'Dimension': [137, 132]}


df = pd.DataFrame(data=d)
print('"Raw" display. This is how dataframes usually displayed on console')

In [ ]: print('formatted display (only in Jupyter Notebooks)')


display(df)
Create a dataframe from a numpy array

In [ ]: df = pd.DataFrame(np.array([["Rick", 136, 1, True], ["Morty", 134, 3, True], [


"Summer", 130, 7, False]]),
columns=['Name', 'Dimension', 'Encounters', 'Is original']) #
Columns names
display(df)

In [ ]: print(df.columns)

In [ ]: print(df["Name"])

Kahoot! (1): What will be printed?

In [ ]: print(df["Dimension"] + df["Encounters"])


In [ ]: df["Dimension"].dtype

In [ ]: df2 = pd.DataFrame(np.array([[136, 1], [134, 3], [130, 7]]),


columns=['Dimension', 'Encounters']) # Columns names

Kahoot! (2): What will be printed?

In [ ]: df2["Dimension"].dtype

In [ ]: df2 = pd.DataFrame(np.array([[136, 1], [134, 3], [130, 7]]),


columns=['Dimension', 'Encounters']) # Column names

display(df2["Dimension"] + df2["Encounters"])

Make sure your dataframe types are correct!


iloc and loc
In [ ]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'],index=[3, 4, 5])
display(df)

In [ ]: display(df.iloc[:2])

In [ ]: display(df.loc[:3])

Do not get confused between `df[]`, `df.loc[]` and `df.iloc[]`!


Pandas' classes

Kahoot! (3): What will be printed?

In [ ]: print(type(df))
print(type(df.iloc[0]))
print(type(df.iloc[:,0]))
Example: Countries of the world dataset

Analysis steps

1. Install pandas with anaconda as described here (via Pycharm)


(http://courses.cs.tau.ac.il/pyProg/2122a/resources/pycharm%20tutorial.pdf) or
here (on windows)
(https://docs.anaconda.com/anaconda/navigator/tutorials/pandas/)
2. Get basic information
3. Data cleaning – ll missing data
4. Basic data analysis
5. Data Visualization
1. Get basic information (and some more basic operations)

Read a csv

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")

Write dataframe to csv

In [ ]: df.to_csv("files/countries-of-the-world_out.csv")

Print the top 3 rows using the head() function

In [ ]: display(df.head(3))

Note that NaN on the second row!


Print the last 5 rows using the tail() function

In [ ]: display(df.tail(5))

Print random 5 rows using the sample() function

In [ ]: display(df.sample(5))
Get statistics

In [ ]: df.info()

In [ ]: display(df.dtypes)
2. Data cleaning

ll missing values

Option 1: replace nan values with 0 (or any other constant value)

In [ ]: df=pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","Reg
ion","Population","Area"]]
display(df)
df = df.fillna(0)
display(df)
Option 2: replace nan values with the average of the column

In [ ]: df=pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","Reg
ion","Population","Area"]]
display(df)
df['Population'].fillna(df['Population'].mean(), inplace=True)
print(df['Population'].mean())
display(df)
Option 3: drop all rows that have any NaN value

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)
df = df.dropna()
display(df)
Convert Square Miles to Square Meters

Formula: 1 Square Mile =2.58 Square Kilometers

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df['Area mile'] = df['Area'].apply(lambda x: x*2.58) # Apply this lambda functio
n on every cell in the Area column
display(df.loc[:,["Area mile","Area"]])
Add a new column

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area", "Birthrate", "Deathrate"]]
display(df)

df["Growing Rate"] = df.apply(lambda row: row["Birthrate"] - row["Deathrate"],


axis=1)
display(df)
A detailed (and longer) version

In [ ]: def get_growing_rate(row):


return row["Birthrate"] - row["Deathrate"]

In [ ]: df["Growing Rate"] = df.apply(get_growing_rate, axis=1)


display(df)
Add a new country (row) to the dataframe

Kahoot! (4): What will be printed?

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
israel = {"Country":"Israel", "Region":"ASIA","Population": 8000000}
df = df.append(israel, ignore_index=True)
display(df.iloc[4].loc['Area'])

In [ ]: display(df)
NaNs

In [ ]: nan_val=df.iloc[4].loc['Area']
print(nan_val)

Kahoot! (5): What will be printed?

In [ ]: print(np.nan=="nan")
print(np.nan=="NaN")
print(nan_val==np.nan)
print(np.isnan(nan_val))
print(pd.isnull(nan_val))
print(np.isnan(np.nan))
print(pd.isnull(np.nan))
Delete the Area column

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)

df = df.drop("Area",1)
# Axes 1 for column. Use df.drop([“A”, “B”], 1) to drop both the A and B column
s

display(df)
Delete the country Angola

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
df = df[df.Country != "Angola"]
display(df)
# This is equivalent
# df = df[df['Country'] != "Angola"]
Leave ony country Angola and Afghanistan

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)

countries = ["Angola","Afghanistan"]
df = df[df.Country.isin(countries)]
display(df)
What will happen if we add ~ before isin ?

Kahoot! (6): Which countries will be printed?

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)

countries = ["Angola","Afghanistan"]
df = df[~df.Country.isin(countries)] # ~ is for not.
display(df.loc[:,"Country"])
Join Tables

Given a new table with the same column names, merge the two tables into a single
table

In [ ]: df1 = pd.read_csv("files/countries-of-the-world.csv").iloc[:3]


df2 = pd.read_csv("files/countries-of-the-world.csv").iloc[3:6]
display(df1)
display(df2)
frames = [df1, df2]
result = pd.concat(frames)
display(result)
Inner join – consider only the intersection of the tables

How many rows are in the intersection?

Outer join – consider the union of the table, ll with Nan missing values.

How many rows are in the union?


In [ ]: df1 = pd.read_csv("files/ex1.csv")
display(df1)
df2 = pd.read_csv("files/ex2.csv")
display(df2)

In [ ]: pd.concat([df1,df2], axis=1)


In [ ]: df1 = pd.read_csv("files/ex1.csv")
display(df1)
df2 = pd.read_csv("files/ex2.csv")
display(df2)

In [ ]: inner = pd.merge(df1, df2, on='Product', how = 'inner')


display(inner)

Kahoot! (7): How many rows in the printed table?

In [ ]: outer = pd.merge(df1, df2, on='Product', how = 'outer')


display(outer)
3. Data analysis

Which country has the largest population?

1. Find the label of the row with the maximum value in the population column
(idxmax())
2. Get the country name of the row with the obtained label (loc)

In [ ]: label_max_pop = df["Population"].idxmax()


display(df.loc[label_max_pop]["Country"])
How many countries have more than 1M people?

1. Select all rows with Population > 1,000,000


2. Count the number of selected rows (use len() to get the num of rows)

In [ ]: print(len(df.loc[df['Population'] > 1000000]))

In [ ]: print(len(df[df['Population'] > 1000000]))

In [ ]: print((df['Population'] > 1000000).sum())


How many countries in Oceania have more than 1M peoples?

Select all countries from Oceania, with Population > 1,000,000


Count the number of selected rows (use len() to get the num of rows)

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[:,["Country","Region",
"Population","Area"]]
df = df[(df['Population'] > 1000000) & (df['Region'] == "OCEANIA")]
display(df)
print(len(df))
Get all countries in the Oceania with Deathrate > 7

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df[(df['Region'] == "OCEANIA") & (df['Deathrate'] > 7)]
display(df)
Sort the countries according to the population size

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df.sort_values(['Population'], ascending=True)
display(df)
How to sort countries with the same/NaN population values?

Break ties according to area size

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df.sort_values(['Population', 'Area'], ascending=True)
display(df)
groupby - extra

What is the average population in each region?

Groupby region
Get the mean of the population column in every group

In [ ]: display(df.groupby(['Region'])['Population'].mean())
Which country in each region has the largest population?

1. Grouby region
2. Get the country with the maximum population in every group

In [ ]: regions = df.groupby(["Region"])


for name, group in regions:
label_max = group["Population"].idxmax()
print(name, df.loc[label_max]["Country"])
Yet, better:

In [ ]: display(df.loc[df.groupby(["Region"])['Population'].idxmax()])
Print the highest mean Deathrate among all regions

Compute the mean deathrate per region


Get the maximum value

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
regions = df.groupby(['Region'])
print(regions['Deathrate'].mean().max())

display(df)
4. Data visualization
Plot an histogram of the GDP column: create a matplotlib gure from Dataframe

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
axarr = df.hist(column='GDP ($ per capita)',bins=10, grid=False, color='#86bf91'
)
for ax in axarr.flatten():
ax.set_xlabel("GDP")
ax.set_ylabel("Count")

plt.show()
Yet, another (more useful) way: pass a Series object to matplotlib

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.hist(df.loc[:,'GDP ($ per capita)'],bins=10, color='#86bf91', label="GDP ($ p
er capita)")
ax.set_xlabel("GDP")
ax.set_ylabel("Count")
ax.legend()
plt.show()
Plot an histogram of the Birthrate and the Deathrate columns: create a matplotlib gure from
Dataframe

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df[["Birthrate", "Deathrate"]]
ax = df.plot.hist(bins=12, alpha=0.5) # alpha for transparent colors
plt.show()
Yet, another (more useful) way: pass a Series object to matplotlib

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.hist(df["Birthrate"], bins=12, alpha=0.5, label='Birthrate') # alpha for tran
sparent colors
ax.hist(df["Deathrate"], bins=12, alpha=0.5, label='Deathrate') # alpha for tran
sparent colors
ax.legend()
plt.show()
Create a boxplot of the Infant mortality, Birthrate and Deathrate columns: create a matplotlib
gure from Dataframe

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
columns = ['Infant mortality (per 1000 births)','Birthrate','Deathrate']
boxplot = df.boxplot(column=columns)

plt.show()
Yet, another (useful) way: pass a Series object to matplotlib

In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
fig, ax = plt.subplots(1,1,figsize=(10,10))
columns = ['Infant mortality (per 1000 births)', 'Birthrate','Deathrate'] # , 'I
nfant mortality (per 1000 births)'
ax.boxplot(df.loc[:,columns].dropna(axis=0), labels=columns)

plt.show()
Pandas summary

We saw how to:


Create a dataframe (from a le, dict, numpy array)
basic information (head(), tail(), dtypes, info())
Clean the data (add/remove columns/rows, fillna())
Merge tables (concat(), merge())
Analyze the dataset (select rows based on a condition, groupby(), mean(),
idxmax(), etc.)
Visualize the data (hist(), boxplot())

For more info visit the pandas documentation (https://pandas.pydata.org/pandas-docs/stable/)


website.
The end! (excpect from exam practice session 🤓)

You might also like