Python Unit -6 Pandas
Python Unit -6 Pandas
3
PANDA
S
• Pandas is an open-source Python Library
• Provides high-performance data manipulation
• Flexible tool for analysis of data
• Python with Pandas is used in a wide range of fields
– academics
– commercial domains including finance, economics, Statistics, analytics, etc.
Why
Pandas?
Most popular library in the scientific Python ecosystem for doing
data analysis.
• read or write in many different data formats (int,float,double, etc..)
• can calculate in all ways data is organized (i.e. across rows and
down columns)
• Select subset of data and combine multiple datasets together
• Find and fill missing data
• Supports reshaping of data into different forms
• Supports advanced time series functionality
• Supports visualization by integrating matplotlib and other libraries
Python Pandas Introduction
• Pandas is defined as an open-source library that provides high-performance data
manipulation in Python.
• The name of Pandas is derived from the word Panel Data, which means an Econometrics from
Multidimensional data. It is used for data analysis in Python and developed by Wes McKinney in 2008.
etc. There are different tools are available for fast data processing, such as Numpy, Scipy,
Cython, and Panda. But we prefer Pandas because working with Pandas is fast, simple.
• Pandas is built on top of the Numpy package, means Numpy is required for operating
the Pandas.
Python Pandas Introduction
• Before Pandas, Python was capable for data preparation, but it only provided limited support for data
analysis. So, Pandas came into the picture and enhanced the capabilities of data analysis. It can
perform five significant steps required for processing and analysis of data irrespective of the origin of the
• It has a fast and efficient DataFrame object with the default and customized indexing.
• Data Representation: It represents the data in a form that is suited for data analysis through its DataFrame
and Series.
• Clear code: The clear API of the Pandas allows you to focus on the core part of the code. So, it provides
1) Series
It is defined as a one-dimensional array that is capable of storing various data types. The
row labels of series are called the index. We can easily convert the list, tuple, and dictionary
into series using "series' method. A Series cannot contain multiple columns. It has one
parameter:
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Explanation: In this code, firstly, we have imported the pandas and numpy library
with the pd and np alias. Then, we have taken a variable named "info" that consist
of an array of some values. We have called the info variable through
a Series method and defined it in an "a" variable. The Series has printed by calling
the print(a) method.
Python Pandas DataFrame
It is a widely used data structure of pandas and works with a two-dimensional array with
labeled axes (rows and columns). DataFrame is defined as a standard way to store data and has
• The columns can be heterogeneous types like int, bool, and so on.
• It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It
import pandas as pd
# a list of strings
x = ['Python', 'Pandas']
list1=[1,2,3,4] list2=[[10,12],[20,34]]
array1=numpy.array(list1) array2=np.array(list2)
print("Array:", array1) print("Array:\n",array2)
12
Array Attributes
type shape
ar1=np.array([[2,4,6],[6,7,8]]) ar1=np.array([[2,4,6],[6,7,8]])
print(type(ar1)) print(ar1.shape)
itemsize size
ar1=np.array([[2,4,6],[6,7,8]]) ar1=np.array([[2,4,6],[6,7,8]])
print(ar1.itemsize) print(ar1.size)
#length of each element in bytes
O/P: 4 O/P: 6
13
Array
Attributes
dtype ndim
ar1=np.array([[2,4,6],[6,7,8]]) ar1=np.array([[2,4,6],[6,7,8]])
print(ar1.dtype) print(ar1.ndim)
O/P: O/P: 2
int32
16
Ways to create numpy
arrays
2) Using zeros()
– Numpy.zeros(shape,[dtype=<datatype>,][order=‘C’ or ‘F’])
>>a2=np.zeros([2,3])
>>a2
array([[0., 0., 0.],
[0., 0., 0.]])
3) Using ones()
– Numpy.ones(shape,[dtype=<datatype>,][order=‘C’ or ‘F’])
>>a2=np.ones([2,3])
>>a2
array([[1., 1., 1.],
[1., 1., 1.]]) 17
Ways to create numpy
arrays
4) Creating arrays with a numerical range using arange()
– arrayname=numpy.arange([start,] stop [,step] [,dtype])
>>ar=np.arange(5) >>ar=np.arange(3,8,1.5,np.float64)
>>ar >>ar
array([0, 1, 2, 3, 4]) array([3. , 4.5, 6. , 7.5])
5) Using linspace()
– arrayname= numpy.linspace(<start>,<stop>,<number of values to be generated>)
>>ar=np.linspace(3,10,4)
>>ar
array([ 3. , 5.33333333, 7.66666667,
10. ]) 18
Pandas Data
• Series
Structures
– 1-D array like object containing an array of data and associated array of data
labels (index).
• Dataframes
– 2D labeled array like,pandas data strutcture that stores an ordered collection
columns that can store data of different types.
Series: Dataframe:
Index Data c1 c2
1 10.0 r1 10 3.5
3 20.0 r2 20 5.7
4 30.0 r3 30 7.9
19
Python Pandas Series
The Pandas Series can be defined as a one-dimensional array that is capable of storing various data types. We
can easily convert the list, tuple, and dictionary into series using "series' method. The row labels of series are
called the index. A Series cannot contain multiple columns. It has the following parameter:
index: The value of the index should be unique and hashable. It must be of the same length as data. If we do
import pandas as pd
x = pd.Series()
print (x)
Create a Series using inputs: We can create Series by using various inputs: Array Dict
Scalar value
Python Pandas Series
Creating Series from Array: Before creating a Series, firstly, we have to import the numpy module
and then use array() function in the program. If the data is ndarray, then the passed index must be of
the same length.
• If we do not pass an index, then by default index of range(n) is being passed where n defines
the length of an array, i.e., [0,1,2,....range(len(array))-1].
import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Python Pandas Series
Create a Series from dict: We can also create a Series from dict. If the dictionary object is being
passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order
to construct the index.
• If index is passed, then values correspond to a particular label in the index will be extracted from
the dictionary.
#import the pandas library
import pandas as pd
import numpy as np
info = {'x' : 0., 'y' : 1., 'z' : 2.}
a = pd.Series(info)
print (a)
Python Pandas Series
Create a Series using Scalar: If we take the scalar values, then the index must be provided. The
scalar value will be repeated for matching the length of the index.
• If index is passed, then values correspond to a particular label in the index will be extracted from
the dictionary.
import pandas as pd
x = pd.Series([1,2,3])
#retrieve the first element
print (x[0])
Series object attributes
• The Series attribute is defined as any information related to the Series object such as size, datatype. etc.
Below are some of the attributes that you can use to get the information about the Series object:
Attributes Description
Series.hasnans It returns True if there are any NaN values, otherwise returns false.
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.shape)
print(b.shape)
Retrieving Dimension, Size and Number of bytes:
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.ndim, b.ndim)
print(a.size, b.size)
print(a.nbytes, b.nbytes)
Checking Emptiness and Presence of NaNs
To check the Series object is empty, you can use the empty attribute. Similarly, to check if a series
object contains some NaN values or not, you can use the hasans attribute.
import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,np.NaN])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
c=pd.Series()
print(a.empty,b.empty,c.empty)
print(a.hasnans,b.hasnans,c.hasnans)
print(len(a),len(b))
print(a.count( ),b.count( ))
Creating Series objects
1) Empty Series object
Import pandas
<Series object>=pandas.Series()
>>ser=pd.Series()
>>ser
Series([], dtype: float64)
31
Creating Series objects
2) Non-empty Series object
<Series object>=pandas.Series(data [, index=idx][,dtype=<data type>])
32
Creating Series objects
Data as Python sequence:
>>ser=pd.Series([10,25,34,41]) >>ser=pd.Series(range(20,50,8),index=[10,20,30,40])
>>ser >>ser
0 10 10 20
1 25 20 28
2 34 30 36
3 41 40 44
dtype: int64 dtype: int64
33
Creating Series objects
Data as an ndarray:
>>ser=pd.Series(np.arange(5,16,3)) >>ser=pd.Series(np.arange(5,16,3),index=[10,20,30])
>>ser >>ser
0 5
1 8
2 11 ??????
3 14
dtype: int32
34
Creating Series objects
Data as an ndarray:
>>ser=pd.Series(np.arange(5,16,3)) >>ser=pd.Series(np.arange(5,16,3),index=[10,20,30])
>>ser >>ser
0 5
1 8
2 11 valueError: Length of passed values is 4, index implies 3
3 14
dtype: int32
35
Creating Series objects
Data as Python Dictionary:
>>ser=pd.Series({"Jan":31,"Feb":28,"Mar":31,"Apr":30})
>>ser
Jan 31
Feb 28
Mar 31
Apr 30
dtype: int64
Note: Order of Indexes may or may not be same as order of keys in dictionary
36
Creating Series objects
data as a scalar value:
• Index must be provided
>>ser=pd.Series(10, index=["r1","r2"]) >>ser=pd.Series('UPES', index=range(10,40,10))
>>ser >>ser
r1 10
10 UPES
r2 20
10 UPES object
dtype:
dtype: 30
int64 UPES
37
Creating Series objects
Adding NaN Values in a series object:
• To fill missing data
>>ser=pd.Series([10, 25, np.NaN, 56, np.NaN, 80])
>>ser
0 10.0
1 25.0
2 NaN
3 56.0
4 NaN
5 80.0
dtype: float64
38
Creating Series objects
Using Mathematical function to create data array in Series()
>>ar=np.arange(10,35,5) >>print(pd.Series(data=(2*[10,20,30])))
>>ar ??????
array([10, 15, 20, 25, 30])
>>ser=pd.Series(ar, ar**2)
>>ser
100 10
225 15
400 20
625 25
900 30
dtype: int32
39
Creating Series objects
Using Mathematical function to create data array in Series()
>>ar=np.arange(10,35,5) >>print(pd.Series(data=(2*[10,20,30])))
>>ar 0
array([10, 15, 20, 25, 30]) 10
>>ser=pd.Series(ar, ar**2) 1
>>ser 20
100 10 2
225 15 30
400 20 3
dtype:
625 25 int64
10
900 30 4
dtype: int32 20
5 40
Creating Series
Repetitive Index: objects
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r1"])
>>ser
r1 10
r2 20
r1 30
r4 40
r1 50
dtype: int64
>>ser["r1"]
r1 10
r1 30
r1 50
dtype: int64 24
Series Object
Attributes
• Series.index
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.index
Index(['r1', 'r2', 'r1', 'r4', 'r2'], dtype='object')
• Series.values
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.values
array([10, 20, 30, 40, 50], dtype=int64)
42
Series Object
Attributes
• Series.dtype (Returns dtype object of the underlying data)
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.dtype
dtype('int64')
• Series.sha
pe (Return
a tuple of
the shape
of
underlyin
g data)
43
Series Object
Attributes
• Series.nbytes (Returns number of bytes in underlying data)
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.nbyte
s 40
• Series.nd
im
(Returns
number
of
dimensio
ns of the
underlyin 44
Series Object
Attributes
• Series.size (Returns number of elements)
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.siz
e 5
• Series.
itemsi
ze
(Retur
n the
size of
the
dtype) 45
Series Object
Attributes
• Series.hasnans (Returns True if any NaN value is found)
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.hasnan
s False
• Series.emp
ty (Returns
True if
series is
empty)
>>ser=pd.Seri
es(range(10,6 46
Accessing elements from
Series
>>ser=pd.Series({"Jan":31,"Feb":28,"Mar":31,"Apr":30,"May":31,"Jun":30,"Jul":31,"Aug":31,
"Sept":30,"Oct":31,"Nov":30,"Dec":31})
>>ser
Jan 31
Feb 28 Try Yourself:
Mar 31
Apr 30 >>ser[2]
May 31 >>ser[2:4]
Jun 30
Jul 31 >>ser[-9:-5]
Aug 31
Sept 30
>>ser[-5:-12]
Oct 31 >>ser[1:10:2]
Nov 30
Dec 31 >>ser[ : :-1]
dtype: int64
47
Operations on Series
Object
• Modifying Series Object
– <SeriesObject>[<index>]=<new data value>
– <SeriesObject>[start:stop]=<new data value> #replace all the values
>>ser=pd.Series([10,20,30,40],index=range(4))
>>ser
0 10
1 20
2 30
3 40
dtype: int64 ser[2:4]=45
>>ser[2]=56 ser
>>ser 0 10
0 10 1 20
1 20 2 45
2 56 3 45
3 40 dtype: int64
dtype: int64 48
Operations on Series
Object
• Modifying Series Indexes
– <Object>.index=<new index array>
>>ser
0
10
1
20
2
45
3
45
dtype
49
:
Operations on Series
• Head() andObject
Tail() functions
– <pandas object>.head([n]) #To fetch first n rows from a pandas object, default is 5
– <pandas object>.tail([n]) #To fetch last n rows from a pandas object, default is 5
• >>ser=pd.Series([10,20,30,40,50,60,70,80])
• >>ser.tail() >>ser.head(3)
3 40 0 10
4 50 1 20
5 60 2 30
6 70 dtype: int64
7 80
dtype: int64
50
Operations on Series
• Head() andObject
Tail() functions
– <pandas object>.head([n]) #To fetch first n rows from a pandas object, default is 5
– <pandas object>.tail([n]) #To fetch last n rows from a pandas object, default is 5
• >>ser=pd.Series([10,20,30,40,50,60,70,80])
>>newser=ser**2 >>ser>20
>>ser*2
>>newser 0 False
0 20
0 100 1 False
1 40
1 400 2 True
2 60
2 900 3 True
3 80
3 1600 4 True
4 100
4 2500 5 True
5 120
5 3600 6 True
6 140
6 4900 7 True
7 160
7 6400 dtype: bool
dtype: int64
dtype: int64
52
Operations on Series
Object
• Arithmetic operations on Series Object
– Operation is performed only on the matching indexes
– If the data items of the two matches are not compatible to perform operation, it
will return NaN
>>ser1 >>ser2 >>ser1+ser2
0 10 0 30 0 40.0
1 20 1 40 1 60.0
2 30 2 50 2 80.0
3 40 3 60 3 100.0
4 50 dtype: 4 NaN
dtype: int64 dtype: float64
int64
53
Operations on Series
Object
>>ser1=pd.Series([10,20,30,40,50],['a','b','c','d','e'])
>>ser2=pd.Series((1,2,3,4,5,6),('a','b','e','g','d','h'))
>>ser1+ser2 >>ser1/ser2
a 11.0 a 10.000000
b 22.0 b 10.000000
c NaN c NaN
d 45.0 d 8.000000
e 53.0 e 16.666667
g. NaN g. NaN
h. NaN h. NaN
dtype: float64 dtype: float64
54
Operations on Series
Object
• Filtering Entries
- <Series Object>[<Boolean Expression on Series Object>]
>>ser1=pd.Series([10,20,30,40,50],['a','b','c','d','e'])
>>ser1>20 >>ser1[ser1>20]
a. False c. 30
b. False d. 40
c. True e. 50
d. Tru dtype: int64
e
e. True
dtype: bool
55
Operations on Series
Object
• Reindexing (creating object with a different order of same indexes)
- <Series Object>=<object>.reindex[<new index sequence>]
- Dropping Entries
- <Series Object>.drop(<index to be removed>)
>>ser1=pd.Series([10,20,30,40,50],['a','b','c','d','e'])
>>ser1.drop('e') >>ser1=pd.Series([10,20,30,40],['a','b','c','d'])
a. 10 >>ser2=ser1.reindex(['d','a','b','c'])
b. 20 >>ser2
c. 30 d
d. 40 40
dtype: int64 a. 10
b. 20
dtype:
c. 30 int64
56
Series Functions
Functions Description
Pandas Series.map() Map the values from two
series that have a common
column.
Pandas Series.std() Calculate the standard
deviation of the given set of
numbers, DataFrame,
column, and rows.
Pandas Series.to_frame() Convert the series object to
the dataframe.
Pandas Series.value_counts() Returns a Series that contain
counts of unique values.
Pandas Series.map()
The main task of map() is used to map the values from two series that have a common column. To map
the two Series, the last column of the first Series should be the same as the index column of the second
series, and the values should be unique.
Parameters
•arg: function, dict, or Series.
It refers to the mapping correspondence.
•na_action: {None, 'ignore'}, Default value None. If ignore, it returns null values, without passing it to the
mapping correspondence.
Returns
It returns the Pandas Series with the same index as a caller.
Pandas Series.map()
import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
Pandas Series.std()
The Pandas std() is defined as a function for calculating the standard deviation of the given set of
numbers, DataFrame, column, and rows. In respect to calculate the standard deviation, we need to
import the package named "statistics" for the calculation of median.
The standard deviation is normalized by N-1 by default and can be changed using the
ddof argument.
Syntax:
Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwar
gs)
import pandas as pd
# calculate standard deviation
import numpy as np
print(np.std([4,7,2,1,6,3]))
print(np.std([6,9,15,2,-17,15,4]))
Numpy Arrays vs Pandas
Series
• Vector operation on 2 ndarrays with different shape will result
into error.
• In ndarrays, the indexes are always numeric starting from 0
onwards.
• But series objects can have any type of indexes, including
numbers (not necessarily starting from 0), letters, labels,
strings,etc….
61
Checkpoin
t
Predict output:
a) pd.Series(((10,20),(30,40)))[0]
b) pd.Series(((10,20),(30,40))).size
62
DataFrame Data
Structure
• 2-D labelled array and ordered collection of columns
• two axes- row index(axis=0) & column index(axis=1)
• elements are identifiable with the combination of row index
(index) and column index (column name).
• Indexes can be numbers or letters or strings
• Columns can have data of different types
• Values can be changed i.e. value mutable
• Add or delete rows/columns in a data frame i.e. size
mutable
63
Creating and Displaying a
DataFrame
• import pandas as pd
• import numpy as np
• <dataframe object>=pd.DataFrame(<2-D data structure>, [columns=<coulumn
sequence>], [index=<index sequence>])
• 2-D Structure can be passed as:
– 2-D dictionaries (dictionaries having lists or dictionaries or ndarrays or Series
objects etc…)
– 2-D ndarrays
– Series type object
– Another DataFrame object
64
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as lists/ndarrays
>>d1={'students':["Deepak","Abhijit","Neha","Swati","Shivansh"] , "age":
[30,32,28,30,2], "Sport":["Cricket","Volleyball","Football","Kabaddi","Athletics"]}
>>pd.DataFrame(d1)
students Sport
1 age
Deepak 30 Cricket • Keys of 2D dictionary have become columns
2 Abhijit 32 Volleyball • Index generated using np.range(n)
3 Neha 28 Football • Order of columns may not be reserved
4 Swati 30 Kabaddi
4 Shivansh 2 Athletics
65
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as dictionary object
>>> d1={"name":"Ravi","age":25,
"marks":60}
>>> d2={"name":"Anil","age":23,
"marks":75}
>>> d3={"name":"Asha", "age":20,
>>res={1:d1,2:d2,3:d3}
"marks":70} >>pd.DataFrame(res)
1 2 3
age 25 23 20
marks 60 75 70
name Ravi Anil Asha
66
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as dictionary object
>>> d1={"name":"Ravi","age":25,
"marks":60} Using from_dict, specify orient='index' to
>>> d2={"name":"Anil","age":23, create the DataFrame using dictionary
"marks":75} keys as rows:
>>> d3={"name":"Asha", "age":20,
>>res={1:d1,2:d2,3:d3}
"marks":70}
>>>>res={1:d1,2:d2,3:d3}
>>pd.DataFrame(res) >>pd.DataFrame.from_dict(res,orient='index')
1 2 name age marks
3 1 Ravi 25 60
marks
age 2560 23 20 2 Anil 23 75
name 75 70 3 Asha 20
Ravi Anil 70
Asha
67
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as dictionary object
68
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D ndarray
>>data = np.array([[10,15],[20,25],[30,35],[40,45]])
>>pd.DataFrame(data)
0 1
0 10 15 >>data = np.array([[10,15],[20,25],[30,35],[40,45]],dtype=float)
1 20 25 >>pd.DataFrame(data, columns=["c1","c2"], index=[1,2,3,4])
2 30 35 c1 c2
3 40 45 1 10.0 15.0
2 20.0 25.0
3 30.0 35.0 Specifying own columns and indexes
4 40.0 45.0
69
Creating and Displaying a
DataFrame
• Creating DataFrame object from a 2D dictionary with values as Series Objects
>>population=pd.Series([7897667,4577637,6457324],index=["Delhi","Mumbai","Dehradun"])
>>avgincome=pd.Series([78976,45637,67324],index=["Delhi","Mumbai","Dehradun"])
>>percapita=avgincome/population
>>dict1={"population":population,"avg income":avgincome,"per capita":percapita}
>>pd.DataFrame(dict1)
70
Creating and Displaying a
DataFrame
• Creating DataFrame object from another DataFrame object
population=pd.Series([7897667,4577637,6457324],index=["Delhi","Mumbai","Dehradun"])
avgincome=pd.Series([78976,45637,67324],index=["Delhi","Mumbai","Dehradun"])
percapita=avgincome/population
dict1={"population":population,"avg income":avgincome,"per capita":percapita}
dtf1=pd.DataFrame(dict1)
dtf2=pd.DataFrame(dtf1,columns=["avg income","population","per capita"])
print(dtf2)
71
DataFrame
Attributes
>>dtf=pd.DataFrame(res)
1 2 3
age 25 23 20
marks 60 75 70
name Ravi Anil Asha
• shap
e
(retu
rns
tupl 73
DataFrame Attributes
• empty
• >>dtf.empty
• False Transposing a DataFrame
>>dtf.T
• len (to find number of rows)
age marks
• >>len(dtf) name 1 25
•3 60
Ravi
• dtypes 2 23 75 Anil
• >>dtf.dtypes 3 20 70 Asha
1 object
2 object dtype: object 74
DataFrame
Attributes
Count (to count non NaN values)
75
Selecting or Accessing a
Data
• Selecting a Column >>dtf
students age Sport
– <DataFrame object>[<Column name>] 1 Deepak 30 Cricket
2 Abhijit 32 Volleyball
OR
3 Neha 28 Football
– <DataFrame object>.<column name> 4 Swati 30 Kabaddi
4 Shivansh 2 Athletics
>>dtf['students'] >>dtf.students
0 Deepak 1 Deepak
1 Abhijit 2 Abhijit
2 Neha 3 Neha
3 Swati 4 Swati
4 Shivansh 5 Shivansh
Name: students, dtype: object Name: students, dtype: object
76
Selecting or Accessing a
Data
• Selecting multiple Columns >>dtf
students age Sport
– <DataFrame object>[[<col name> ,<col name>…]] 0 Deepak Cricket
1 Abhijit
30 32 Volleyball
2 Neha 28 Football
3 Swati 30 Kabaddi
>>dtf[['students','age']] 4 Shivansh 2 Athletics
students age
0 Deepak
1 30
2 Abhijit 32
3 Neha
4 Shivansh
28
Swati
30
2
77
Selecting or Accessing a
Data
• Selecting subset from a dataframe using row column names
– <DataFrame object>.loc[<start row>:<end row>,<start column>:<end column>]
>>dtf.loc[0 , : ] #To access one row >>dtf.loc[0:2 , : ] #to access multiple rows
students Deepak students age Sport
age 30 0 Deepak 30 Cricket
Sport Cricket 1 Abhijit 32 Volleyball
Name: 0, dtype: object 2 Neha 28 Football
>>dtf.loc[: ,"age":"Sport"] #to access selective col >> dtf.loc[2:4,"age":"Sport"] #selective rows & col
age Sport age Sport
1 30 Cricket 2 28 Football
2 32 Volleyball 3 30 Kabaddi
3 28 Football 4 2 Athletics End index is included
4 30 Kabaddi
5 2 Athletics
78
Selecting or Accessing a
Data
Obtaining subset/slice using row/column numeric index/position
<DF object>.iloc[<start row index>:<end row index >,<start column index>:<end column index>]
>>dtf.iloc[0:2,1:3]
age Sport
1 30 Cricket
2 32 Volleyball
End index is excluded
>>dtf.iloc[2:4]
students age
Sport
2 Neha 28
Football
3 Swati 30
Kabaddi
79
Selecting or Accessing a
Data
• Selecting/Accessing Individual values
<DF object>.<column>[<row name or row numeric index>]
>>dtf.students[4 >>dtf.students[[4,3]]
] 'Shivansh' 4
Shivansh
Name:
3 students, dtype: object
Swati
<DF object>.at[<row label>,<col label>]
<DF object>.iat[<row index no>,<col index number>]
dtf.at[2,"students" dtf.iat[2,2
] 'Neha' ]
'Football'
80
Assigning/Modifying Data Values in
Dataframes
• To change or add a column
<DF object>[<column name>]=<new value>
>>dtf1=pd.DataFrame(dtf) Since No column with name country therefore new column added
>>dtf1["country"]="india"
>>dtf1
>>dtf1["country"]="Bharat"
students Sport countr
>>dtf1
1 age
Deepa 30 Cricket
y
students age Sport country
k 32 Volleyball
india
1 Deepak 30 Cricket Bharat
2 Abhijit india indi
2 Abhijit 32 Volleyball Bharat
3 Neha 28 Football a
3 Neha 28 Football Bharat
Swati 30 Kabaddi india
4 Shivansh
4 Swati 30 Kabaddi Bharat
2 Athletics
4 Shivansh 2 Athletics Bharat
india
81
Assigning/Modifying Data Values in
Dataframes
• To change or add a row
<DF object>.at[< row name>]=<new value>
dtf1.at[6]=20 >>dtf1.at[5]=["Ravi",56,"Badminton","Bharat"]
dtf1 >>dtf1
students Sport country students Sport country
1 Deepak 30.0 Cricket Bharat 1 age
Deepak 30.0 Cricket Bharat
2 age
Abhijit 32.0 Volleyball 2Abhijit 32.0 Volleyball Bharat
3 Neha 28.0 Football Bharat 2 Neha 28.0 Football Bharat
4 Swati 30.0 Kabaddi 3 Swati 30.0 Kabaddi Bharat
5 Shivansh 2.0 Athletics Bharat 4 Shivansh 2.0 Athletics Bharat
6 20 20.0 20 6 20 20.0 20 20
Bharat 5 Ravi 56.0 Badminton Bharat
• Add or + • Sub or -
>>dtf1+dtf2 >>dtf1.add(dtf2) >>dtf2-dtf1 >>dtf2.sub(dtf1)
0 1 2 0 1 2 0 0 1
0 11 22 33 0 11 22 33 2
1 44 55 66 1 44 55 66 1 0 9 18 27
2 77 88 99 2 77 88 99 1 36 45 54
2 2 63 72 81
>>dtf1.radd(dtf2) # dtf2 + dtf1 , Check with strings >>dtf2.rsub(dtf1) #dtf1 – dtf2
0 1 2 0 09 18 27 1 2
0 11 22 33 10 36 45
-9 -18 -27
1 44 55 66 154-36 -45 -54
2 77 88 99 22 -63
63 -72
72 -81
81
88
Binary operations in a
DataFrame
dtf1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
dtf2=pd.DataFrame([[10,20,30],[40,50,60],[70,80,90]])
• Div or / • Mul or *
>>dtf1/dtf2 >>dtf1.div(dtf2) >>dtf1*dtf2 >>dtf1.mul(dtf2)
0 1 2 0 1 0 0 1
0 0.1 0.1 2 2
0.1 0 0.1 0.1 0.1 1 0 10 40 90
1 0.1 0.1 1 0.1 0.1 0.1 1 160 250 360
0.1 2 0.1 0.1 0.1 2 2 490 640 810
2 0.1 0.1 >>dtf1.rdiv(dtf2) 0 10 40
0.1 0 1 2 90
0 10.0 10.0 10.0 1 160 250
1 10.0 10.0 10.0 360
2 10.0 10.0 10.0 2 490 640
810 89
Essential
Functions
• Inspection Functions info() and describe()
>>dtf1
>>dtf1.describe()
0 1 2
0 1 2 3 0
1 4 5 6 1 #Count of non NA values
2 7 8 9 2 # mean of values in column
count 3.0 3.0 3.0 #Std. Deviation
>>dtf1.info()
min
mean 1.0
4.02.0
5.03.0
6.0 # min value
<class 'pandas.core.frame.DataFrame'>
25% 2.53.0
std 3.0 3.53.0
4.5 # 25th percentile
RangeIndex: 3 entries, 0 to 2
50% 4.0 5.0 6.0
Data columns (total 3 columns):
75% 5.5 6.5 7.5
1 3 non-null int64
max 7.0 8.0 9.0 # Max value
2 3 non-null int64
3 3 non-null int64
dtypes: int64(3)
memory usage: 112.0 bytes
69
Essential
Functions
• describe() functions in case of string column
>>dtf
age marks
name 1 25
60
Ravi
2
>>dtf.describe()
23 75 Count: #non NA entries in the column
Anil
age marks Unique: #unique entries in the col
3 20name
count 370 3 3 Top: most common entry in the col /highest frequency
Asha 3 3 3
unique Freq: frequency of the most common element
top 23 75 Anil
freq 1 1 1 91
Essential
Functions
• Head() and Tail()
– <DF object>.head([n]) #rows from top, default value 5
– <DF object>.tail([n]) #rows from bottom, default value 5
>>dtf1 >>dtf1.head(4) >>dtf1.head(7).tail(2)
0 1 0 1 0 1 2
0 1.0 2.02 3.0 0 1.0 2.02 3.0 5 13.0 14.0 15.0
1 4.0 5.0 6.0 1 4.0 5.0 6.0 6 16.0 17.0 18.0
2 7.0 8.0 9.0 2 7.0 8.0 9.0
4 10.0 11.0 12.0 4 10.0 11.0 12.0
5 13.0 14.0 15.0
6 16.0 17.0 18.0
92
Essential
Functions
• Cumulative Calculation Functions
– <DF>.cumsum([axis=None]) # default is rows
– Calculates cumulative sum ( sum of each row is replaced by sum of all prior rows)
94
Matching and Broadcasting
Operations
• Matching: When performing arithmetic operations, data is aligned on
the basis of matching indexes and then perform arithmetic; for non-
overlapping arithmetic operations result in NaN .
>>dtf1 >>dtf1*4 Broadcasting: Smaller
0 1 2 0 1 2
0 1 2 0 1 object is broadcast
2 0 1.0 2.0 3.0 0 4 4 4
across the size of larger
0 1.0 2.0 3.0 0 4.0 8.0 1 4.0 5.0 6.0 1 4 4 4
12.0 2 4 4 4 Object for compatible
2 7.0 8.0 9.0
1 4.0 5.0 6.0 1 16.0 20.0 24.0 3 4 4 4 Shapes.
4 10.0 11.0 12.0
2 7.0 8.0 9.0 2 28.0 32.0 36.0 5 13.0 14.0 15.0 4 4 4 4
5 13.0 14.0 15.0 5 52.0 56.0 60.0
4 10.0 11.0 12.0 4 40.0 44.0 48.0 6 16.0 17.0 18.0 5 4 4 4
6 16.0 17.0 18.0 6 64.0 68.0 72.0
6 4 4 4
95
Handling Missing
Data
• Missing values are the values that cannot contribute to any
computation. (NULL or NaN or None).
• Sources of Missing Values
– User forgot to fill in a field.
– Data was lost while transferring manually from a legacy
database.
– There was a programming error.
96
Handling Missing
Data
Creating data with missing values:
df = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f','h'],\
columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
one two
three
a 0.297695 -1.122686
0.707862
b NaN NaN
NaN
c 0.992830 -0.421574
0.922447
d NaN NaN
NaN 97
Handling Missing
Data
Check for missing values: isnull() and notnull() functions
print(df.isnull()) print(df.notnull()) print(df['two'].notnull())
one two three a. True
one two three
a False False False a True True True b. False
b True True True b False False False c. True
c False False False c True True True d. False
d True True True d False False False e. True
e False False False e True True True f. True
f True True True g. False
f False False
g False False False h. Tru
g True True True
False
h True True True Name:
e two, dtype: bool
h False False
False 98
Handling Missing
Data
• Calculations with Missing Data
– When summing data, NaN will be treated as Zero
print(df['two'].sum())
-0.5127836471871685
print(df.fillna({"one":10,"two":20,"three":30}))
print(df.fillna(7))
one two three a - one two three
0.210141 0.802803 0.765983 a 1.567348 0.343046 -2.076060
b 7.000000 7.000000 7.000000 b 10.000000 20.000000 30.000000
c -1.079648 -1.270144 -2.003383 c 0.460282 -0.398845 1.390478
d 7.000000 7.000000 7.000000 d 10.000000 20.000000 30.000000
e -0.377301 0.183492 0.347551 e 1.245430 1.633978 1.818819
f 1.457437 0.167696 1.045180 f 1.643829 -0.756501 -0.378391
g 7.000000 7.000000 7.000000 g 10.000000 20.000000 30.000000
h 0.957371 -0.000443 0.648315 h -0.282579 1.307354 0.440571
100
Handling Missing
• Filling Missing Data
Data
– fillna(<n>) : will fill all NaN values with value n
– fillna(<dictionary having fill values for each columns>)
print(df.fillna({"one":10,"two":20,"three":30}))
print(df.fillna(7))
one two three a - one two three
0.210141 0.802803 0.765983 a 1.567348 0.343046 -2.076060
b 7.000000 7.000000 7.000000 b 10.000000 20.000000 30.000000
c -1.079648 -1.270144 -2.003383 c 0.460282 -0.398845 1.390478
d 7.000000 7.000000 7.000000 d 10.000000 20.000000 30.000000
e -0.377301 0.183492 0.347551 e 1.245430 1.633978 1.818819
f 1.457437 0.167696 1.045180 f 1.643829 -0.756501 -0.378391
g 7.000000 7.000000 7.000000 g 10.000000 20.000000 30.000000
h 0.957371 -0.000443 0.648315 h -0.282579 1.307354 0.440571
101
Comparisons of Pandas
objects
• np.NaN==np.NaN -> False
df1=pd.DataFrame([(1,2,3),(4,5,6),(7,8,9)],columns=["A","B","C"])
print(df1)
df2=pd.DataFrame([(11,22,np.NaN),(44,55,np.NaN),(77,88,99)],columns=["A","B","C"])
print(df2)
print(df1+df2==df1.add(df2))
print((df1+df2).equals(df1.a
dd(df2))) # Returns True if
2 NaN values are Acompared
B C
0 1 2 3
1 4 5 6 A
2 7 8 9
A B C B
0 11 22 NaN
1 44 55 NaN
2 77 88 99.0 C 102
Combining
DataFrames
• Using concat: (concat all the rows)
one = pd.DataFrame({ Name subject_id Marks_scored
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 1 Alex sub1 98
'Ayoung'], 2 Amy sub2 90
3 Allen sub4 87
'subject_id':['sub1','sub2','sub4','sub6','sub5'], 4 Alice sub6 69
'Marks_scored':[98,90,87,69,78]}, 5 Ayoung sub5 78
index=[1,2,3,4,5]) 1 Billy sub2 89
2 Brian sub4 80
id Name subject_id 3 Bran sub3 79
two = pd.DataFrame({ 0 1 Alex sub1 4 Bryce sub6 97
'Name': ['Billy', 'Brian',
1 2 'Bran',
Amy 'Bryce',
sub2 5 Betty sub5 88
'Betty'], 2 3 Allen sub4
3 4 Alice sub6
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
4 5 Ayoung sub5
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print(pd.concat([one,two],axis=0))
Deepak Sharma, Asst. Professor, UPES Dehradun
82
Combining
DataFrames
• Using merge: (two rows with some common values)
left = pd.DataFrame( { 'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce',
'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
id Name subject_id id Name subject_id
print(left)
print (right) 0 1 Billy sub2
0 1 Alex sub1
1 2 Amy sub2 1 2 Brian sub4
2 3 Allen sub4 2 3 Bran sub3
3 4 Alice sub6 3 4 Bryce sub6
4 5 Ayoung sub5 4 5 Betty sub5
104
Combining
DataFrames
• Merge Two DataFrames on a Key id Name id Name subject_id
subject_id 1 1 Billy sub2
1 1 Alex sub1 2 2 Brian sub4
print (pd.merge(left,right, 2 23 Allen
Amy sub4sub2 2 3 Bran sub3
on='id')) 3 4 Alice sub6 3 4 Bryce sub6
4 5 Ayoung sub5 4 5 Betty sub5
id Name_x subject_id_x Name_y subject_id_y
0 1 Alex sub1 Billy sub2
1 2 Amy sub2 Brian sub4
2 3 Allen sub4 Bran sub3
3 4 Alice sub6 Bryce sub6
4 5 Ayoung sub5 Betty sub5
105
Combining
• Merge Using 'how' Argument
DataFrames id Name subject_id id Name subject_id
0 1 Alex sub1 0 1 Billy sub2
1 2 Amy sub2 1 2 Brian sub4
Merge Method SQL Equivalent Description 2 3 Allen sub4 2 3 Bran sub3
3 4 Alice sub6 3 4 Bryce sub6
left LEFT OUTER JOIN Use keys from left 4 5 Ayoung sub5 4 5 Betty sub5
object
right RIGHT OUTER Use keys from print (pd.merge(left, right, on='subject_id',
JOIN right object how='left‘))
Name_x id_x subject_id Name_y id_y
outer FULL OUTER JOIN Use union of keys 0 Alex 1 sub1 NaN NaN
inner INNER JOIN Use intersection 1 Amy 2 sub2 Billy 1.0
of keys 2 Allen 3 sub4 Brian 2.0
3 Alice 4 sub6 Bryce 4.0
4 5 sub5 Betty 5.0
Ayoung
106