CS2209 Python Pandas
CS2209 Python Pandas
CS2209
1
• Pandas is an open source library built on top of NumPy
• It allows for fast analysis and data cleaning and preparation
• It excels in performance and productivity
• It also has built-in visualization features
• It can work with data from a wide variety of sources
2
• You need to install pandas by going to your command line or terminal and using
either-
– conda install pandas
– pip install pandas
3
• Series
• Dataframes
• Missing data
• GroupBy
• Merging, Joining, and Concatenation
• Operations
4
Series
5
Series
import numpy as np
This line creates a Pandas
import pandas as pd Series from my_data without
labels = ['a','b','c'] specifying an index, so it 0 10
my_data = [10,20,30] uses the default integer 1 20
arr = np.array(my_data) index [0, 1, 2]
2 30
d = {'a':10,'b':20,'c':30} dtype: int64
pd.Series(my_data)
pd.Series(data=my_data,index=labels)
a 10
b 20
This line creates a Pandas Series from c 30
my_data with the specified labels as the dtype: int64
index.
6
Series
import numpy as np
import pandas as pd This line creates a Pandas Series using
labels = ['a','b','c'] arr (a NumPy array containing [10, 20, a 10
my_data = [10,20,30] 30]) as the data, with labels (['a', 'b', b 20
arr = np.array(my_data) 'c']) as the index.
c 30
d = {'a':10,'b':20,'c':30} dtype: int64
pd.Series(arr,labels)
pd.Series(d)
a 10
b 20
This line creates a Pandas Series directly from the dictionary d ({'a':10, c 30
'b':20, 'c':30}). When a dictionary is used to create a Series, the dictionary dtype: int64
keys become the index, and the dictionary values become the Series
values.
7
Series
import numpy as np
import pandas as pd A Pandas Series is created in which
labels = ['a','b','c'] the data is the labels list (['a', 'b', 'c']). 0 a
my_data = [10,20,30] Since no index is specified, Pandas will 1 b
arr = np.array(my_data) use the default integer index (0, 1, 2).
2 c
d = {'a':10,'b':20,'c':30} dtype: object
pd.Series(data=labels)
8
Series
ser1 is a Pandas Series with values [1,
2, 3, 4] and corresponding indices
import numpy as np ['USA', 'Germany', 'USSR', 'Japan'].
import pandas as pd
ser1 = pd.Series([1,2,3,4],['USA','Germany','USSR','Japan']) USA 1
ser2 = pd.Series([1,2,3,4],['USA','Germany','Italy','Japan']) Germany 2
print(ser1) USSR 3
print(ser1['Germany'],ser2['Italy']) Japan 4
dtype: int64
9
Series
import numpy as np
import pandas as pd
ser1 = pd.Series([1,2,3,4],['USA','Germany','USSR','Japan']) Germany 4.0
ser2 = pd.Series([1,2,3,4],['USA','Germany','Italy','Japan']) Italy NaN
print(ser1+ser2) Japan 8.0
USA 2.0
This line performs element-wise addition between ser1 and
ser2 based on their indices:
USSR NaN
Matching Indices: If an index label exists in both Series (e.g., dtype: float64
'USA', 'Germany', 'Japan'), their values are added.
Non-Matching Indices: If an index label exists only in one
Series (e.g., 'USSR' in ser1 and 'Italy' in ser2), the result for
that label is NaN.
10
Data Frame
A data frame represents a rectangular table of data and contains an
ordered, named collection of columns, each of which can be a
different value type (such as numeric, string, Boolean, etc)
A data frame has both a row and a column index; it can be thought of
as a dictionary of series all sharing the same index
11
Data Frames
Imports randn from numpy.random to generate random
numbers from a normal distribution
import numpy as np
Sets a random seed with the value 101. Setting the seed
import pandas as pd ensures that the random numbers generated are the same
from numpy.random import randn each time the code is run, which makes the results
np.random.seed(101) reproducible.
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'],
['W','X','Y','Z'])
W X Y Z
Generates a 5x4 matrix of random numbers from a normal
A 2.706850 0.628133 0.907969 0.503826
distribution (mean 0, standard deviation 1).Creates a Pandas
DataFrame named df with this 5x4 matrix, where: The row B 0.651118 -0.319318 -0.848077 0.605965
labels (index) are ['A', 'B', 'C', 'D', 'E']. C -2.018168 0.740122 0.528813 -0.589001
The column labels are ['W', 'X', 'Y', 'Z']. D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
A bunch of series that share a common index
12
Data Frames
W X Y Z
import numpy as np A 2.706850 0.628133 0.907969 0.503826
import pandas as pd B 0.651118 -0.319318 -0.848077 0.605965
from numpy.random import randn C -2.018168 0.740122 0.528813 -0.589001
np.random.seed(101) D 0.188695 -0.758872 -0.933237 0.955057
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'], E 0.190794 1.978757 2.605967 0.683509
['W','X','Y','Z'])
df['W'] A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
13
Data Frames
W X Y Z
import numpy as np A 2.706850 0.628133 0.907969 0.503826
import pandas as pd B 0.651118 -0.319318 -0.848077 0.605965
from numpy.random import randn C -2.018168 0.740122 0.528813 -0.589001
np.random.seed(101) D 0.188695 -0.758872 -0.933237 0.955057
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'], E 0.190794 1.978757 2.605967 0.683509
['W','X','Y','Z'])
df[['X','Y']] X Y
A 0.628133 0.907969
B -0.319318 -0.848077
Selects and prints only the X and Y C 0.740122 0.528813
columns from the DataFrame df. D -0.758872 -0.933237
E 1.978757 2.605967
14
Data Frames
W X Y Z
import numpy as np A 2.706850 0.628133 0.907969 0.503826
import pandas as pd B 0.651118 -0.319318 -0.848077 0.605965
from numpy.random import randn C -2.018168 0.740122 0.528813 -0.589001
np.random.seed(101) D 0.188695 -0.758872 -0.933237 0.955057
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'], E 0.190794 1.978757 2.605967 0.683509
['W','X','Y','Z'])
df['newCol']=df['X'] + df['Y'] W X Y Z newCol
print(df) A 2.706850 0.628133 0.907969 0.503826 1.536102
B 0.651118 -0.319318 -0.848077 0.605965 -1.167395
C -2.018168 0.740122 0.528813 -0.589001 1.268936
This line adds a new column, 'newCol', to D 0.188695 -0.758872 -0.933237 0.955057 -1.692109
the DataFrame df. The values in 'newCol'
are calculated by adding the values of E 0.190794 1.978757 2.605967 0.683509 4.584725
columns 'X' and 'Y' for each row.
15
Data Frames
import numpy as np W X Y Z newCol
import pandas as pd A 2.706850 0.628133 0.907969 0.503826 1.536102
from numpy.random import randn B 0.651118 -0.319318 -0.848077 0.605965 -1.167395
np.random.seed(101) C -2.018168 0.740122 0.528813 -0.589001 1.268936
df = pd.DataFrame(randn(5,4), D 0.188695 -0.758872 -0.933237 0.955057 -1.692109
['A','B','C','D','E'],['W','X','Y','Z']) E 0.190794 1.978757 2.605967 0.683509 4.584725
df['newCol']=df['X'] + df['Y']
df.drop('newCol',axis=1,inplace=True)
W X Y Z
print(df)
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
Removes (drops) the 'newCol' column from C -2.018168 0.740122 0.528813 -0.589001
the DataFrame.axis=1 specifies that a D 0.188695 -0.758872 -0.933237 0.955057
column is being dropped, not a row.
inplace=True modifies the DataFrame df
E 0.190794 1.978757 2.605967 0.683509
directly instead of returning a modified
16
copy.
Data Frames: loc function
• In Pandas, the .loc[] function is used to access, filter, and modify rows and
columns in a DataFrame based on labels (index names) rather than numerical
positions
17
Data Frames: loc function
import numpy as np W X Y Z
import pandas as pd A 2.706850 0.628133 0.907969 0.503826
from numpy.random import randn B 0.651118 -0.319318 -0.848077 0.605965
np.random.seed(101) C -2.018168 0.740122 0.528813 -0.589001
df = pd.DataFrame(randn(5,4), D 0.188695 -0.758872 -0.933237 0.955057
['A','B','C','D','E'],['W','X','Y','Z']) E 0.190794 1.978757 2.605967 0.683509
print(df.loc['B','Y'])
print(df.loc[['A','B'],['W','Y']]) -0.8480769834036315
It uses df.loc['B', 'Y'] to access and print the value
at the intersection of row 'B' and column 'Y'.
W Y
Access a subset of the DataFrame df, specifically the A 2.706850 0.907969
values located at the intersection of rows 'A' and 'B' B 0.651118 -0.848077
with columns 'W' and 'Y'.
18
Data Frames: Conditional Selection
import numpy as np booldf = df > 0 generates a W X Y Z
import pandas as pd boolean DataFrame booldf A True True True True
where each value is True if
from numpy.random import randn the corresponding value in B True False False True
np.random.seed(101) df is greater than 0, and C False True True False
df = pd.DataFrame(randn(5,4), False otherwise. D True False False True
['A','B','C','D','E'],['W','X','Y','Z']) E True True True True
booldf = df>0 W X Y Z
print(booldf) A 2.706850 0.628133 0.907969 0.503826
print(df[booldf]) B 0.651118 NaN NaN 0.605965
df[booldf] filters the
DataFrame df using booldf,
C NaN 0.740122 0.528813 NaN
replacing all values where D 0.188695 NaN NaN 0.955057
booldf is False with NaN E 0.190794 1.978757 2.605967 0.683509
(since they don't meet the
condition > 0).
19
Data Frames: Conditional Selection
import numpy as np resultdf = df[df['W'] > 0]
import pandas as pd creates a new DataFrame
resultdf that only includes
from numpy.random import randn rows from df where the
np.random.seed(101) values in column 'W' are
df = pd.DataFrame(randn(5,4), greater than 0.
['A','B','C','D','E'],['W','X','Y','Z']) W X Y Z
resultdf = df[df['W']>0] A 2.706850 0.628133 0.907969 0.503826
print(resultdf['X']) B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
resultdf['X'] accesses and
prints only the 'X' column
A 0.628133
of the filtered DataFrame B -0.319318
resultdf. D -0.758872
The last two statements are equivalent to E 1.978757
print(df[df['W']>0]['X']) Name: X, dtype: float64 20
Data Frames: Multiple Conditions
import numpy as np
W X Y Z
import pandas as pd
A 2.706850 0.628133 0.907969 0.503826
from numpy.random import randn
B 0.651118 -0.319318 -0.848077 0.605965
np.random.seed(101)
C -2.018168 0.740122 0.528813 -0.589001
df = pd.DataFrame(randn(5,4),
D 0.188695 -0.758872 -0.933237 0.955057
['A','B','C','D','E'],['W','X','Y','Z'])
E 0.190794 1.978757 2.605967 0.683509
print(df)
print(df[df['W']>0][['Y','X']])
23
Data Frames: new index
import numpy as np
['CA', 'NY', 'WY', 'OR', 'CO']
import pandas as pd
from numpy.random import randn W X Y Z States
np.random.seed(101) A 2.706850 0.628133 0.907969 0.503826 CA
df = pd.DataFrame(randn(5,4), B 0.651118 -0.319318 -0.848077 0.605965 NY
['A','B','C','D','E'],['W','X','Y','Z']) C -2.018168 0.740122 0.528813 -0.589001 WY
newind = 'CA NY WY OR CO'.split() D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df['States']=newind
df.set_index('States', inplace=True)
W X Y Z
States
Since no specific separator is CA 2.706850 0.628133 0.907969 0.503826
provided to the split() method, it NY 0.651118 -0.319318 -0.848077 0.605965
defaults to splitting at each space WY -2.018168 0.740122 0.528813 -0.589001
between the words. OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509 24
Data Frames: Multilevel index
import numpy as np
A list with group labels "G1" and "G2".
import pandas as pd
from numpy.random import randn A list with numbers that will serve as the
np.random.seed(101) inner level of the
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3] The zip(outside, inside) function pairs
hier_index = list(zip(outside,inside)) each element from outside with the
corresponding element in inside, creating
tuples. list(zip(...)) converts the result into
a list of tuples. Results in [('G1', 1), ('G1',
2), ('G1', 3), ('G2', 1), ('G2', 2), ('G2', 3)]
25
Data Frames: Multilevel index
import numpy as np
import pandas as pd
from numpy.random import randn
np.random.seed(101)
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3] A B
hier_index = list(zip(outside,inside)) (G1, 1) 2.706850 0.628133
df = pd.DataFrame(randn(6,2),hier_index,['A','B']) (G1, 2) 0.907969 0.503826
print(df) (G1, 3) 0.651118 -0.319318
(G2, 1) -0.848077 0.605965
(G2, 2) -2.018168 0.740122
(G2, 3) 0.528813 -0.589001
26
Data Frames: Multilevel index
import numpy as np
import pandas as pd
from numpy.random import randn A B
np.random.seed(101) G1 1 2.706850 0.628133
outside = ['G1','G1','G1','G2','G2','G2'] 2 0.907969 0.503826
inside = [1,2,3,1,2,3] 3 0.651118 -0.319318
hier_index = list(zip(outside,inside)) G2 1 -0.848077 0.605965
hier_index = pd.MultiIndex.from_tuples(hier_index) 2 -2.018168 0.740122
df = pd.DataFrame(randn(6,2),hier_index,['A','B']) 3 0.528813 -0.589001
df.loc['G1']
df.loc['G1'].loc[2]) A B
1 2.706850 0.628133
A 0.907969 2 0.907969 0.503826
B 0.503826 3 0.651118 -0.319318
Name: 2, dtype: float64 27
Data Frames: Multilevel index
import numpy as np A B
import pandas as pd G1 1 2.706850 0.628133
from numpy.random import randn 2 0.907969 0.503826
np.random.seed(101) 3 0.651118 -0.319318
outside = ['G1','G1','G1','G2','G2','G2'] G2 1 -0.848077 0.605965
inside = [1,2,3,1,2,3] 2 -2.018168 0.740122
hier_index = list(zip(outside,inside)) 3 0.528813 -0.589001
hier_index = pd.MultiIndex.from_tuples(hier_index) A B
df = pd.DataFrame(randn(6,2),hier_index,['A','B']) Groups Num
df.index.names = ['Groups','Num'] G1 1 2.706850 0.628133
df.loc['G2'] 2 0.907969 0.503826
df.loc['G2'].loc[2]['A'] Num 3 0.651118 -0.319318
1 -0.848077 0.605965 G2 1 -0.848077 0.605965
-2.018168244037392 2 -2.018168 0.740122 2 -2.018168 0.740122
3 0.528813 -0.589001 3 0.528813 -0.589001 28
Data Frames: Multilevel index (Cross Section)
import numpy as np A B
import pandas as pd Groups Num
from numpy.random import randn G1 1 2.706850 0.628133
np.random.seed(101) 2 0.907969 0.503826
outside = ['G1','G1','G1','G2','G2','G2'] 3 0.651118 -0.319318
inside = [1,2,3,1,2,3] G2 1 -0.848077 0.605965
hier_index = list(zip(outside,inside)) 2 -2.018168 0.740122
hier_index = pd.MultiIndex.from_tuples(hier_index) 3 0.528813 -0.589001
df = pd.DataFrame(randn(6,2),hier_index,['A','B'])
df.index.names = ['Groups','Num'] A B
df.xs('G1') Num
df.xs(1,level='Num') A B 1 2.706850 0.628133
Groups 2 0.907969 0.503826
G1 2.706850 0.628133 3 0.651118 -0.319318
G2 -0.848077 0.605965 29
Missing Data
import numpy as np {'A': [1, 2, nan], 'B': [5, nan, nan],
import pandas as pd 'C': [1, 2, 3]}
d = {'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]}
df = pd.DataFrame(d) A B C
df.dropna() 0 1.0 5.0 1
df.dropna(axis=1) 1 2.0 NaN 2
df.dropna(thresh=2) 2 NaN NaN 3
A B C
0 1.0 5.0 1
C
A B C
0 1
0 1.0 5.0 1
1 2
1 2.0 NaN 2
2 3 30