0% found this document useful (0 votes)
33 views64 pages

Justenoughpython Pandas 220915 175329

The document provides an overview of using the Pandas library in Python for data manipulation, including reading and writing CSV and Excel files, handling missing values, and visualizing data with Matplotlib. It explains key concepts such as DataFrames and Series, data selection methods, and common operations for data cleaning and analysis. Additionally, it introduces Seaborn for enhanced data visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views64 pages

Justenoughpython Pandas 220915 175329

The document provides an overview of using the Pandas library in Python for data manipulation, including reading and writing CSV and Excel files, handling missing values, and visualizing data with Matplotlib. It explains key concepts such as DataFrames and Series, data selection methods, and common operations for data cleaning and analysis. Additionally, it introduces Seaborn for enhanced data visualization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

PYTHON PROGRAMMING

JUST ENOUGH PYTHON II

DR. ALEX NG
Pandas

2
What's Pandas for?
For example, say you want to explore a dataset stored in a CSV on your
computer. Pandas will extract the data from that CSV into a DataFrame —
a table, basically — then let you do things like:

1.Calculate statistics and answer questions about the data, like


What's the average, median, max, or min of each column?
Does column A correlate with column B?
What does the distribution of data in column C look like?

2. Clean the data by doing things like removing missing values and
filtering rows or columns by some criteria

3. Visualize the data with help from Matplotlib. Plot bars, lines,
histograms, bubbles, and more.

4.Store the cleaned, transformed data back into a CSV, other file or
database
3
Pandas (Panel Data) – File manipulation

Installing pandas
pip install pandas
pip install tabulate

Import
import pandas as pd
Series / DataFrame
6
Series is one dimensional data
Frame is two dimensional data
You can specify the index labels

Auto-generate if not specify


Remember what is this data type?

If not specified, index will start from 0


Remember what are these two functions?

How df looks like?

You can check the shape of


the data frame
df.head() shows first 5 rows

df.tail() shows first 5 rows

Try:
df.head(8) what you expected?
df.tail(8) what you expected?
How about if you have many number of rows?
Check number of records (rows)
Why add 1?
(because of exclusive)
Column Selection
When selecting one column, it is possible to use single set of brackets, but
the resulting object will be a Series (not a DataFrame):

In [ ]:#Selectcolumn salary:
df['salary']

When we need to select more than one column and/or make the output to
be a DataFrame, we should use double brackets:
In [ ]:#Select
column salary:
df[['rank','salary']]

17
If we need to select a range of rows, we can specify the range using ":"

In [ ]:#Selectrows by their position:


df[10:20]

Notice that the first row has a position 0, and the last value in the range is
omitted:
So for 0:10 range the first 10 rows are returned with the positions starting
with 0 and ending with 9

20
If we need to select a range of rows, using their labels we can use method
loc:
In [ ]:#Selectrows by their labels:
df_sub.loc[10:20,['rank','sex','salary']]

Out[ ]:

21
If we need to select a range of rows and/or columns, using their positions
we can use method iloc:
In [ ]:#Select
rows by their labels:
df_sub.iloc[10:20,[0, 3, 4, 5]]

Out[ ]:

22
Data Frames: method iloc (summary)

df.iloc[0] # First row of a data frame


df.iloc[i] #(i+1)th row
df.iloc[-1] # Last row

df.iloc[:, 0] # First column


df.iloc[:, -1] # Last column

df.iloc[0:7] #First 7 rows


df.iloc[:, 0:2] #First 2 columns
df.iloc[1:3, 0:2] #Second through third rows and first
2 columns
df.iloc[[0,5], [1,3]] #1st and 6th rows and 2nd and 4th
columns

23
Mini-exercise 1
Try:
panda\ex1.py
Work with Excel / CSV
We can open a file using:

with open("omni2_1963.dat") as f:
print(f.readline())

Demo: panda\openfile.py
Dataset:
panda\omni2_1963.dat
Reading Excel
import pandas as pd
data=pd.read_excel(“xxxx.xlsx”)
data.head() #data.head(10)
data=pd.read_excel(“xxxx.xlsx”, sheet_name=0, encoding=‘utf-8’)

Reading CSV
data=pd.read_csv(“xxxx.csv”)
data=pd.read_csv(“xxxx.csv”, nrows=9)
data.head() #data.head(10)
data=pd.read_csv(“xxxx.csv”, delimiter=‘’, encoding=‘utf-8’)

can be a remote file


read certain columns we can use the
parameter usecols

Demo:
panda\openremote.py
write the dataframe to CSV file using Pandas
to_csv method Demo:
panda\openremote.py

df.to_excel()
Mini-exercise 2
1. From df, filter the ‘Manufacturer’, ‘Model’ and ‘Type’ for
every 20th row starting from 1st (row 0). Given Cars93.csv

print(df) / print(df.to_markdown())
Demo:
panda\ex2.py
Handle Missing Value
34
Dirty data

35
Check is there any missing values
Common operations

dropna

Fillna(0)
Replace ALL nan values

Replace single column nan values


Lab1: panda\demopandas.py
Part 1: Check is there any null values.
#4 ways to find the null values if present in the dataset.

#method 1 - data.isna()
This function provides the boolean value for the complete
dataset to know if any null value is present or not.

#method 2 - data.isna().any()
This function also gives a boolean value if any null value is
present or not, but it gives results column-wise, not in
tabular format.

#method 3 - data.isna().sum()
This function gives the sum of the null values preset in the dataset column-
wise.

#method 4 - data.isna().any().sum()
This function gives output in a single value if any null is
present or not.
41
Part 2 Handling NAN

#method 1 - fill the NaNs in this column with zeros.


fillna

#method 2 - drop any columns with NaNs by using the axis=1


parameter
dropna(inplace=True, axis=1)

#method 3 - fill up some value


fillna(data['PetalWidthCm'].mean(),inplace=True)

#method 4 – remove duplicates


data.drop_duplicates(inplace=True)

42
Matplotlib – Visualizing your data
Graph Types

44
import matplotlib.pyplot as plt
y = [1, 5, 3, 5, 7, 8]
x = [1, 2, 3, 4, 5, 20]
plt.plot(x, y)
Demo:
plt.show() Matplotlib\Plot1.py
Multiple lines
Demo:
Matplotlib\ Plot2.py
Demo:
Matplotlib\ Plot3.py
import matplotlib.pyplot as plt
X = [590,540,740,130,810,300,320,230,470,620,770,250]
Y = [32,36,39,52,61,72,77,75,68,57,48,48]

plt.title('Relationship Between Temperature and Iced Coffee Sales')


plt.xlabel('Cups of Iced Coffee Sold') Demo:
plt.ylabel('Temperature in Fahrenheit')
plt.scatter(X,Y)
plt.show()
Matplotlib\ Plot4.py
Scatter

ylabel

xlabel
subplots()
Demo:
Matplotlib\ Plot5.py
plt.subplot(1, 2, 1)

plt.subplot(1, 2, 2)

index1 index2

1 row

2 columns
Demo:
Matplotlib\ Plot6.py
Open file and plot the graph
Demo:
Matplotlib\csvplot2.py
DataSet: Matplotlib\ data.csv
Stock close price

Demo:
Matplotlib\stockplot.py
DataSet: matplotlib\0388.HK.csv
Demo:
Matplotlib\matplot1.py

55
Demo:
Matplotlib\matplot2.py

56
Seaborn
Better Visualization
Seaborn:
▪ based on matplotlib

▪ provides high level interface for drawing


attractive statistical graphics

▪ Similar (in style) to the popular ggplot2 library in


R

Link:
https://seaborn.pydata.org/ 58
60
Demo:
seaborn\ sea1.py
Titanic dataset
Confusion matrix
Demo:
seaborn\ sea2.py

You might also like