Rec 11
Rec 11
Rec 11
Data fetching
Data cleaning
Data analysis
Data visualzation
What is the Pandas package using for?
Calculate statistics and answer questions about (mostly tabular) data
e.g., What's the average, median, max, or min of each column?
Clean the data by doing things like removing missing values and ltering rows or
columns by some criteria
Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles,
and more.
Store the cleaned, transformed data back into a CSV, other le or a database
Dataframe
In [ ]: print(df.columns)
In [ ]: print(df["Name"])
In [ ]: df2["Dimension"].dtype
display(df2["Dimension"] + df2["Encounters"])
In [ ]: display(df.iloc[:2])
In [ ]: display(df.loc[:3])
In [ ]: print(type(df))
print(type(df.iloc[0]))
print(type(df.iloc[:,0]))
Example: Countries of the world dataset
Analysis steps
Read a csv
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
In [ ]: df.to_csv("files/countries-of-the-world_out.csv")
In [ ]: display(df.head(3))
In [ ]: display(df.tail(5))
In [ ]: display(df.sample(5))
Get statistics
In [ ]: df.info()
In [ ]: display(df.dtypes)
2. Data cleaning
ll missing values
Option 1: replace nan values with 0 (or any other constant value)
In [ ]: df=pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","Reg
ion","Population","Area"]]
display(df)
df = df.fillna(0)
display(df)
Option 2: replace nan values with the average of the column
In [ ]: df=pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","Reg
ion","Population","Area"]]
display(df)
df['Population'].fillna(df['Population'].mean(), inplace=True)
print(df['Population'].mean())
display(df)
Option 3: drop all rows that have any NaN value
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)
df = df.dropna()
display(df)
Convert Square Miles to Square Meters
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df['Area mile'] = df['Area'].apply(lambda x: x*2.58) # Apply this lambda functio
n on every cell in the Area column
display(df.loc[:,["Area mile","Area"]])
Add a new column
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area", "Birthrate", "Deathrate"]]
display(df)
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
israel = {"Country":"Israel", "Region":"ASIA","Population": 8000000}
df = df.append(israel, ignore_index=True)
display(df.iloc[4].loc['Area'])
In [ ]: display(df)
NaNs
In [ ]: nan_val=df.iloc[4].loc['Area']
print(nan_val)
In [ ]: print(np.nan=="nan")
print(np.nan=="NaN")
print(nan_val==np.nan)
print(np.isnan(nan_val))
print(pd.isnull(nan_val))
print(np.isnan(np.nan))
print(pd.isnull(np.nan))
Delete the Area column
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)
df = df.drop("Area",1)
# Axes 1 for column. Use df.drop([“A”, “B”], 1) to drop both the A and B column
s
display(df)
Delete the country Angola
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
df = df[df.Country != "Angola"]
display(df)
# This is equivalent
# df = df[df['Country'] != "Angola"]
Leave ony country Angola and Afghanistan
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)
countries = ["Angola","Afghanistan"]
df = df[df.Country.isin(countries)]
display(df)
What will happen if we add ~ before isin ?
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[[0,1,2,3],["Country","R
egion","Population","Area"]]
display(df)
countries = ["Angola","Afghanistan"]
df = df[~df.Country.isin(countries)] # ~ is for not.
display(df.loc[:,"Country"])
Join Tables
Given a new table with the same column names, merge the two tables into a single
table
Outer join – consider the union of the table, ll with Nan missing values.
1. Find the label of the row with the maximum value in the population column
(idxmax())
2. Get the country name of the row with the obtained label (loc)
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv").loc[:,["Country","Region",
"Population","Area"]]
df = df[(df['Population'] > 1000000) & (df['Region'] == "OCEANIA")]
display(df)
print(len(df))
Get all countries in the Oceania with Deathrate > 7
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df[(df['Region'] == "OCEANIA") & (df['Deathrate'] > 7)]
display(df)
Sort the countries according to the population size
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df.sort_values(['Population'], ascending=True)
display(df)
How to sort countries with the same/NaN population values?
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df.sort_values(['Population', 'Area'], ascending=True)
display(df)
groupby - extra
Groupby region
Get the mean of the population column in every group
In [ ]: display(df.groupby(['Region'])['Population'].mean())
Which country in each region has the largest population?
1. Grouby region
2. Get the country with the maximum population in every group
In [ ]: display(df.loc[df.groupby(["Region"])['Population'].idxmax()])
Print the highest mean Deathrate among all regions
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
regions = df.groupby(['Region'])
print(regions['Deathrate'].mean().max())
display(df)
4. Data visualization
Plot an histogram of the GDP column: create a matplotlib gure from Dataframe
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
axarr = df.hist(column='GDP ($ per capita)',bins=10, grid=False, color='#86bf91'
)
for ax in axarr.flatten():
ax.set_xlabel("GDP")
ax.set_ylabel("Count")
plt.show()
Yet, another (more useful) way: pass a Series object to matplotlib
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.hist(df.loc[:,'GDP ($ per capita)'],bins=10, color='#86bf91', label="GDP ($ p
er capita)")
ax.set_xlabel("GDP")
ax.set_ylabel("Count")
ax.legend()
plt.show()
Plot an histogram of the Birthrate and the Deathrate columns: create a matplotlib gure from
Dataframe
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
df = df[["Birthrate", "Deathrate"]]
ax = df.plot.hist(bins=12, alpha=0.5) # alpha for transparent colors
plt.show()
Yet, another (more useful) way: pass a Series object to matplotlib
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.hist(df["Birthrate"], bins=12, alpha=0.5, label='Birthrate') # alpha for tran
sparent colors
ax.hist(df["Deathrate"], bins=12, alpha=0.5, label='Deathrate') # alpha for tran
sparent colors
ax.legend()
plt.show()
Create a boxplot of the Infant mortality, Birthrate and Deathrate columns: create a matplotlib
gure from Dataframe
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
columns = ['Infant mortality (per 1000 births)','Birthrate','Deathrate']
boxplot = df.boxplot(column=columns)
plt.show()
Yet, another (useful) way: pass a Series object to matplotlib
In [ ]: df = pd.read_csv("files/countries-of-the-world.csv")
fig, ax = plt.subplots(1,1,figsize=(10,10))
columns = ['Infant mortality (per 1000 births)', 'Birthrate','Deathrate'] # , 'I
nfant mortality (per 1000 births)'
ax.boxplot(df.loc[:,columns].dropna(axis=0), labels=columns)
plt.show()
Pandas summary