Justenoughpython Pandas 220915 175329
Justenoughpython Pandas 220915 175329
DR. ALEX NG
Pandas
2
What's Pandas for?
For example, say you want to explore a dataset stored in a CSV on your
computer. Pandas will extract the data from that CSV into a DataFrame —
a table, basically — then let you do things like:
2. Clean the data by doing things like removing missing values and
filtering rows or columns by some criteria
3. Visualize the data with help from Matplotlib. Plot bars, lines,
histograms, bubbles, and more.
4.Store the cleaned, transformed data back into a CSV, other file or
database
3
Pandas (Panel Data) – File manipulation
Installing pandas
pip install pandas
pip install tabulate
Import
import pandas as pd
Series / DataFrame
6
Series is one dimensional data
Frame is two dimensional data
You can specify the index labels
Try:
df.head(8) what you expected?
df.tail(8) what you expected?
How about if you have many number of rows?
Check number of records (rows)
Why add 1?
(because of exclusive)
Column Selection
When selecting one column, it is possible to use single set of brackets, but
the resulting object will be a Series (not a DataFrame):
In [ ]:#Selectcolumn salary:
df['salary']
When we need to select more than one column and/or make the output to
be a DataFrame, we should use double brackets:
In [ ]:#Select
column salary:
df[['rank','salary']]
17
If we need to select a range of rows, we can specify the range using ":"
Notice that the first row has a position 0, and the last value in the range is
omitted:
So for 0:10 range the first 10 rows are returned with the positions starting
with 0 and ending with 9
20
If we need to select a range of rows, using their labels we can use method
loc:
In [ ]:#Selectrows by their labels:
df_sub.loc[10:20,['rank','sex','salary']]
Out[ ]:
21
If we need to select a range of rows and/or columns, using their positions
we can use method iloc:
In [ ]:#Select
rows by their labels:
df_sub.iloc[10:20,[0, 3, 4, 5]]
Out[ ]:
22
Data Frames: method iloc (summary)
23
Mini-exercise 1
Try:
panda\ex1.py
Work with Excel / CSV
We can open a file using:
with open("omni2_1963.dat") as f:
print(f.readline())
Demo: panda\openfile.py
Dataset:
panda\omni2_1963.dat
Reading Excel
import pandas as pd
data=pd.read_excel(“xxxx.xlsx”)
data.head() #data.head(10)
data=pd.read_excel(“xxxx.xlsx”, sheet_name=0, encoding=‘utf-8’)
Reading CSV
data=pd.read_csv(“xxxx.csv”)
data=pd.read_csv(“xxxx.csv”, nrows=9)
data.head() #data.head(10)
data=pd.read_csv(“xxxx.csv”, delimiter=‘’, encoding=‘utf-8’)
Demo:
panda\openremote.py
write the dataframe to CSV file using Pandas
to_csv method Demo:
panda\openremote.py
df.to_excel()
Mini-exercise 2
1. From df, filter the ‘Manufacturer’, ‘Model’ and ‘Type’ for
every 20th row starting from 1st (row 0). Given Cars93.csv
print(df) / print(df.to_markdown())
Demo:
panda\ex2.py
Handle Missing Value
34
Dirty data
35
Check is there any missing values
Common operations
dropna
Fillna(0)
Replace ALL nan values
#method 1 - data.isna()
This function provides the boolean value for the complete
dataset to know if any null value is present or not.
#method 2 - data.isna().any()
This function also gives a boolean value if any null value is
present or not, but it gives results column-wise, not in
tabular format.
#method 3 - data.isna().sum()
This function gives the sum of the null values preset in the dataset column-
wise.
#method 4 - data.isna().any().sum()
This function gives output in a single value if any null is
present or not.
41
Part 2 Handling NAN
42
Matplotlib – Visualizing your data
Graph Types
44
import matplotlib.pyplot as plt
y = [1, 5, 3, 5, 7, 8]
x = [1, 2, 3, 4, 5, 20]
plt.plot(x, y)
Demo:
plt.show() Matplotlib\Plot1.py
Multiple lines
Demo:
Matplotlib\ Plot2.py
Demo:
Matplotlib\ Plot3.py
import matplotlib.pyplot as plt
X = [590,540,740,130,810,300,320,230,470,620,770,250]
Y = [32,36,39,52,61,72,77,75,68,57,48,48]
ylabel
xlabel
subplots()
Demo:
Matplotlib\ Plot5.py
plt.subplot(1, 2, 1)
plt.subplot(1, 2, 2)
index1 index2
1 row
2 columns
Demo:
Matplotlib\ Plot6.py
Open file and plot the graph
Demo:
Matplotlib\csvplot2.py
DataSet: Matplotlib\ data.csv
Stock close price
Demo:
Matplotlib\stockplot.py
DataSet: matplotlib\0388.HK.csv
Demo:
Matplotlib\matplot1.py
55
Demo:
Matplotlib\matplot2.py
56
Seaborn
Better Visualization
Seaborn:
▪ based on matplotlib
Link:
https://seaborn.pydata.org/ 58
60
Demo:
seaborn\ sea1.py
Titanic dataset
Confusion matrix
Demo:
seaborn\ sea2.py