0% found this document useful (0 votes)

5 views106 pages

Python Unit - 6 Pandas

The document provides an overview of the Pandas library, highlighting its capabilities for high-performance data manipulation and analysis in Python. It covers key data structures like Series and DataFrame, along with operations and functionalities such as handling missing data, reshaping datasets, and integrating with NumPy. Additionally, it explains the benefits of using Pandas for data representation and clarity in coding.

Uploaded by

arorarmaan321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views106 pages

Python Unit - 6 Pandas

Uploaded by

arorarmaan321

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 106

•PANDAS

•(PANel DAta System)

Outline
• Why Pandas?
• Introduction to numpy
• Pandas Data Structures: Series, and Data Frame
• Operations on a Series: head, tail, vector operations
• Data Frame operations: create, display, iteration, select column, add column,
delete column
• Binary operations in a Data Frame: add, sub, mul, div, radd, rsub
• Matching and broadcasting operations
• Missing data and filling values
• Comparisons, Boolean reductions, comparing Series, and combining Data
Frames.

3
PANDA
S
• Pandas is an open-source Python Library
• Provides high-performance data manipulation
• Flexible tool for analysis of data
• Python with Pandas is used in a wide range of fields
– academics
– commercial domains including finance, economics, Statistics, analytics, etc.
Why
Pandas?
Most popular library in the scientific Python ecosystem for doing
data analysis.
• read or write in many different data formats (int,float,double, etc..)
• can calculate in all ways data is organized (i.e. across rows and
down columns)
• Select subset of data and combine multiple datasets together
• Find and fill missing data
• Supports reshaping of data into different forms
• Supports advanced time series functionality
• Supports visualization by integrating matplotlib and other libraries
Python Pandas Introduction
• Pandas is defined as an open-source library that provides high-performance data

manipulation in Python.
• The name of Pandas is derived from the word Panel Data, which means an Econometrics from

Multidimensional data. It is used for data analysis in Python and developed by Wes McKinney in 2008.

• Data analysis requires lots of processing, such as restructuring, cleaning or merging,

etc. There are different tools are available for fast data processing, such as Numpy, Scipy,

Cython, and Panda. But we prefer Pandas because working with Pandas is fast, simple.

• Pandas is built on top of the Numpy package, means Numpy is required for operating

the Pandas.
Python Pandas Introduction
• Before Pandas, Python was capable for data preparation, but it only provided limited support for data

analysis. So, Pandas came into the picture and enhanced the capabilities of data analysis. It can

perform five significant steps required for processing and analysis of data irrespective of the origin of the

data, i.e., load, manipulate, prepare, model, and analyze.

• It has a fast and efficient DataFrame object with the default and customized indexing.

• Used for reshaping and pivoting of the data sets.

• Group by data for aggregations and transformations.

• Provide the functionality of Time Series.

Python Pandas Introduction
Benefits of Pandas
• The benefits of pandas over using other language are as follows:

• Data Representation: It represents the data in a form that is suited for data analysis through its DataFrame

and Series.

• Clear code: The clear API of the Pandas allows you to focus on the core part of the code. So, it provides

clear and concise code for the user.

Python Pandas Introduction
Python Pandas Data Structure: The Pandas provides two data structures for processing the

data, i.e., Series and DataFrame, which are discussed below:

1) Series

It is defined as a one-dimensional array that is capable of storing various data types. The

row labels of series are called the index. We can easily convert the list, tuple, and dictionary

into series using "series' method. A Series cannot contain multiple columns. It has one

parameter:

Data: It can be any list, dictionary, or scalar value.

Python Pandas Introduction
Creating Series from Array: Before creating a Series, Firstly, we have to import the numpy module and then

use array() function in the program.

import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)

Explanation: In this code, firstly, we have imported the pandas and numpy library
with the pd and np alias. Then, we have taken a variable named "info" that consist
of an array of some values. We have called the info variable through
a Series method and defined it in an "a" variable. The Series has printed by calling
the print(a) method.
Python Pandas DataFrame
It is a widely used data structure of pandas and works with a two-dimensional array with

labeled axes (rows and columns). DataFrame is defined as a standard way to store data and has

two different indexes, i.e., row index and column index.

It consists of the following properties:

• The columns can be heterogeneous types like int, bool, and so on.

• It can be seen as a dictionary of Series structure where both the rows and columns are indexed. It

is denoted as "columns" in case of columns and "index" in case of rows.

Python Pandas DataFrame
Create a DataFrame using List:

import pandas as pd
# a list of strings
x = ['Python', 'Pandas']

# Calling DataFrame constructor on list

df = pd.DataFrame(x)
print(df)

Explanation: In this code, we have defined a variable named "x"

that consist of string values. The DataFrame constructor is being
called on a list to print the values.
NumPy
Arrays
• NumPy array is simply a grid that contains homogeneous values.
• Two forms:
1-D Array (vectors)
Multidimensional Array (Matrices)
import numpy import numpy as np

list1=[1,2,3,4] list2=[[10,12],[20,34]]
array1=numpy.array(list1) array2=np.array(list2)
print("Array:", array1) print("Array:\n",array2)

O/P: Array: [1 2 3 4] Array:

[[10 12]
[20 34]]

12
Array Attributes
type shape
ar1=np.array([[2,4,6],[6,7,8]]) ar1=np.array([[2,4,6],[6,7,8]])
print(type(ar1)) print(ar1.shape)

O/P: <class 'numpy.ndarray'> O/P: (2, 3)

itemsize size
ar1=np.array([[2,4,6],[6,7,8]]) ar1=np.array([[2,4,6],[6,7,8]])
print(ar1.itemsize) print(ar1.size)
#length of each element in bytes
O/P: 4 O/P: 6

13
Array
Attributes
dtype ndim
ar1=np.array([[2,4,6],[6,7,8]]) ar1=np.array([[2,4,6],[6,7,8]])
print(ar1.dtype) print(ar1.ndim)

O/P: O/P: 2
int32

Axes -> dimensions in numpy array

#Axes-> rank

The axes of an array describe the

order of indexing into the array.
axis=0 refers to first index
coordinate, 14
NumPy Array vs
Python List
• Array size can not be changed
• Array can contain elements of homogeneous type only
• Array support vectorized operations
>>a2=np.array([1,4,6,9,3,7,9])
>>ar1=np.array([[2,4,6],[6,7,8]])
>>a2[1]
4
>>ar1 + 2
>>a2[-1]
array([[ 4, 6, 8],
9
[ 8, 9, 10]])
>>a2[-
3:]
>>ar1 *2
array([3, #not supported in lists
array([[ 4, 8, 12],
7, 9])
[12, 14, 16]])
array([1,
>>a2[- 4, 6, 9, 6, 6, 6]) 15
3:]=6
Ways to create numpy
arrays
1) Using empty()
– Numpy.empty(shape,[dtype=<datatype>,][order=‘C’ or ‘F’])
Default dtype is float
Default order is C (C like)
>>a2=np.empty([3,2],dtype=int) >>a2=np.empty([2,3])
>>a2 >>a2
array([[ 4, 6], array([[2.31504695e-302, 1.37962049e-306, 1.11260619e-306],
[ 8, 8], [1.78010255e-306, 9.79054228e-307, 3.49699318e-317]])
[ 9, 10]])

16
Ways to create numpy
arrays
2) Using zeros()
– Numpy.zeros(shape,[dtype=<datatype>,][order=‘C’ or ‘F’])
>>a2=np.zeros([2,3])
>>a2
array([[0., 0., 0.],
[0., 0., 0.]])
3) Using ones()
– Numpy.ones(shape,[dtype=<datatype>,][order=‘C’ or ‘F’])
>>a2=np.ones([2,3])
>>a2
array([[1., 1., 1.],
[1., 1., 1.]]) 17
Ways to create numpy
arrays
4) Creating arrays with a numerical range using arange()
– arrayname=numpy.arange([start,] stop [,step] [,dtype])
>>ar=np.arange(5) >>ar=np.arange(3,8,1.5,np.float64)
>>ar >>ar
array([0, 1, 2, 3, 4]) array([3. , 4.5, 6. , 7.5])
5) Using linspace()
– arrayname= numpy.linspace(<start>,<stop>,<number of values to be generated>)
>>ar=np.linspace(3,10,4)
>>ar
array([ 3. , 5.33333333, 7.66666667,
10. ]) 18
Pandas Data
• Series
Structures
– 1-D array like object containing an array of data and associated array of data
labels (index).

• Dataframes
– 2D labeled array like,pandas data strutcture that stores an ordered collection
columns that can store data of different types.
Series: Dataframe:
Index Data c1 c2
1 10.0 r1 10 3.5
3 20.0 r2 20 5.7
4 30.0 r3 30 7.9

19
Python Pandas Series
The Pandas Series can be defined as a one-dimensional array that is capable of storing various data types. We

can easily convert the list, tuple, and dictionary into series using "series' method. The row labels of series are

called the index. A Series cannot contain multiple columns. It has the following parameter:

data: It can be any list, dictionary, or scalar value.

index: The value of the index should be unique and hashable. It must be of the same length as data. If we do

not pass any index, default np.arrange(n) will be used.

dtype: It refers to the data type of series.

copy: It is used for copying the data.

Python Pandas Series

Creating a Series: We can create a Series in two ways:

Create an empty Series: We can easily create an empty series in Pandas which means it will not
have any value. <series object> = pandas.Series().

import pandas as pd
x = pd.Series()
print (x)

Create a Series using inputs: We can create Series by using various inputs: Array Dict
Scalar value
Python Pandas Series
Creating Series from Array: Before creating a Series, firstly, we have to import the numpy module
and then use array() function in the program. If the data is ndarray, then the passed index must be of
the same length.
• If we do not pass an index, then by default index of range(n) is being passed where n defines
the length of an array, i.e., [0,1,2,....range(len(array))-1].

import pandas as pd
import numpy as np
info = np.array(['P','a','n','d','a','s'])
a = pd.Series(info)
print(a)
Python Pandas Series
Create a Series from dict: We can also create a Series from dict. If the dictionary object is being
passed as an input and the index is not specified, then the dictionary keys are taken in a sorted order
to construct the index.

• If index is passed, then values correspond to a particular label in the index will be extracted from
the dictionary.
#import the pandas library
import pandas as pd
import numpy as np
info = {'x' : 0., 'y' : 1., 'z' : 2.}
a = pd.Series(info)
print (a)
Python Pandas Series
Create a Series using Scalar: If we take the scalar values, then the index must be provided. The
scalar value will be repeated for matching the length of the index.

• If index is passed, then values correspond to a particular label in the index will be extracted from
the dictionary.

#import pandas library

import pandas as pd
import numpy as np
x = pd.Series(4, index=[0, 1, 2, 3])
print (x)
Accessing data from series with Position:
• Once you create the Series type object, you can access its indexes, data, and even
individual elements.
• The data in the Series can be accessed similar to that in the ndarray

import pandas as pd
x = pd.Series([1,2,3])
#retrieve the first element
print (x[0])
Series object attributes
• The Series attribute is defined as any information related to the Series object such as size, datatype. etc.
Below are some of the attributes that you can use to get the information about the Series object:
Attributes Description

Series.index Defines the index of the Series.

Series.shape It returns a tuple of shape of the data.

Series.dtype It returns the data type of the data.

Series.size It returns the size of the data.

Series.empty It returns True if Series object is empty, otherwise returns false.

Series.hasnans It returns True if there are any NaN values, otherwise returns false.

Series.nbytes It returns the number of bytes in the data.

Series.ndim It returns the number of dimensions in the data.

Series.itemsize It returns the size of the datatype of item.

Retrieving Index array and data array of a series object
• We can retrieve the index array and data array of an existing Series object by using the attributes index and
values
import numpy as np
import pandas as pd
x=pd.Series(data=[2,4,6,8])
y=pd.Series(data=[11.2,18.6,22.5], index=['a','b','c'])
print(x.index)
print(x.values)
print(y.index)
print(y.values)
Retrieving Shape
• The shape of the Series object defines total number of elements including missing or empty values(NaN).

import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.shape)
print(b.shape)
Retrieving Dimension, Size and Number of bytes:

import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,4])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
print(a.ndim, b.ndim)
print(a.size, b.size)
print(a.nbytes, b.nbytes)
Checking Emptiness and Presence of NaNs
To check the Series object is empty, you can use the empty attribute. Similarly, to check if a series
object contains some NaN values or not, you can use the hasans attribute.

import numpy as np
import pandas as pd
a=pd.Series(data=[1,2,3,np.NaN])
b=pd.Series(data=[4.9,8.2,5.6],index=['x','y','z'])
c=pd.Series()
print(a.empty,b.empty,c.empty)
print(a.hasnans,b.hasnans,c.hasnans)
print(len(a),len(b))
print(a.count( ),b.count( ))
Creating Series objects
1) Empty Series object
 Import pandas
 <Series object>=pandas.Series()

>>ser=pd.Series()
>>ser
Series([], dtype: float64)

31
Creating Series objects
2) Non-empty Series object
<Series object>=pandas.Series(data [, index=idx][,dtype=<data type>])

Data can be:

a) Python sequence
b) An ndarray
c) Python dictionary
d) Scalar value
Default index : integers from range 0 to N-1 , N is length of data

32
Creating Series objects
Data as Python sequence:

>>ser=pd.Series([10,25,34,41]) >>ser=pd.Series(range(20,50,8),index=[10,20,30,40])
>>ser >>ser
0 10 10 20
1 25 20 28
2 34 30 36
3 41 40 44
dtype: int64 dtype: int64

33
Creating Series objects
Data as an ndarray:

>>ser=pd.Series(np.arange(5,16,3)) >>ser=pd.Series(np.arange(5,16,3),index=[10,20,30])
>>ser >>ser
0 5
1 8
2 11 ??????
3 14
dtype: int32

34
Creating Series objects
Data as an ndarray:

>>ser=pd.Series(np.arange(5,16,3)) >>ser=pd.Series(np.arange(5,16,3),index=[10,20,30])
>>ser >>ser
0 5
1 8
2 11 valueError: Length of passed values is 4, index implies 3
3 14
dtype: int32

35
Creating Series objects
Data as Python Dictionary:

>>ser=pd.Series({"Jan":31,"Feb":28,"Mar":31,"Apr":30})
>>ser
Jan 31
Feb 28
Mar 31
Apr 30
dtype: int64

Note: Order of Indexes may or may not be same as order of keys in dictionary

36
Creating Series objects
data as a scalar value:
• Index must be provided
>>ser=pd.Series(10, index=["r1","r2"]) >>ser=pd.Series('UPES', index=range(10,40,10))
>>ser >>ser
r1 10
10 UPES
r2 20
10 UPES object
dtype:
dtype: 30
int64 UPES

37
Creating Series objects
Adding NaN Values in a series object:
• To fill missing data
>>ser=pd.Series([10, 25, np.NaN, 56, np.NaN, 80])
>>ser
0 10.0
1 25.0
2 NaN
3 56.0
4 NaN
5 80.0
dtype: float64

38
Creating Series objects
Using Mathematical function to create data array in Series()
>>ar=np.arange(10,35,5) >>print(pd.Series(data=(2*[10,20,30])))
>>ar ??????
array([10, 15, 20, 25, 30])
>>ser=pd.Series(ar, ar**2)
>>ser
100 10
225 15
400 20
625 25
900 30
dtype: int32
39
Creating Series objects
Using Mathematical function to create data array in Series()
>>ar=np.arange(10,35,5) >>print(pd.Series(data=(2*[10,20,30])))
>>ar 0
array([10, 15, 20, 25, 30]) 10
>>ser=pd.Series(ar, ar**2) 1
>>ser 20
100 10 2
225 15 30
400 20 3
dtype:
625 25 int64
10
900 30 4
dtype: int32 20
5 40
Creating Series
Repetitive Index: objects
>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r1"])
>>ser
r1 10
r2 20
r1 30
r4 40
r1 50
dtype: int64

>>ser["r1"]
r1 10
r1 30
r1 50
dtype: int64 24
Series Object
Attributes
• Series.index

>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.index
Index(['r1', 'r2', 'r1', 'r4', 'r2'], dtype='object')

• Series.values

>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.values
array([10, 20, 30, 40, 50], dtype=int64)

42
Series Object
Attributes
• Series.dtype (Returns dtype object of the underlying data)

>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.dtype
dtype('int64')

• Series.sha
pe (Return
a tuple of
the shape
of
underlyin
g data)
43
Series Object
Attributes
• Series.nbytes (Returns number of bytes in underlying data)

>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.nbyte
s 40

• Series.nd
im
(Returns
number
of
dimensio
ns of the
underlyin 44
Series Object
Attributes
• Series.size (Returns number of elements)

>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.siz
e 5

• Series.
itemsi
ze
(Retur
n the
size of
the
dtype) 45
Series Object
Attributes
• Series.hasnans (Returns True if any NaN value is found)

>>ser=pd.Series(range(10,60,10),index=["r1","r2","r1","r4","r2"])
>>ser.hasnan
s False

• Series.emp
ty (Returns
True if
series is
empty)

>>ser=pd.Seri
es(range(10,6 46
Accessing elements from
Series
>>ser=pd.Series({"Jan":31,"Feb":28,"Mar":31,"Apr":30,"May":31,"Jun":30,"Jul":31,"Aug":31,
"Sept":30,"Oct":31,"Nov":30,"Dec":31})
>>ser
Jan 31
Feb 28 Try Yourself:
Mar 31
Apr 30 >>ser[2]
May 31 >>ser[2:4]
Jun 30
Jul 31 >>ser[-9:-5]
Aug 31
Sept 30
>>ser[-5:-12]
Oct 31 >>ser[1:10:2]
Nov 30
Dec 31 >>ser[ : :-1]
dtype: int64
47
Operations on Series
Object
• Modifying Series Object
– <SeriesObject>[<index>]=<new data value>
– <SeriesObject>[start:stop]=<new data value> #replace all the values
>>ser=pd.Series([10,20,30,40],index=range(4))
>>ser
0 10
1 20
2 30
3 40
dtype: int64 ser[2:4]=45
>>ser[2]=56 ser
>>ser 0 10
0 10 1 20
1 20 2 45
2 56 3 45
3 40 dtype: int64
dtype: int64 48
Operations on Series
Object
• Modifying Series Indexes
– <Object>.index=<new index array>

>>ser
0

10
1

20
2

45
3

45
dtype
49
:
Operations on Series
• Head() andObject
Tail() functions
– <pandas object>.head([n]) #To fetch first n rows from a pandas object, default is 5
– <pandas object>.tail([n]) #To fetch last n rows from a pandas object, default is 5

• >>ser=pd.Series([10,20,30,40,50,60,70,80])

• >>ser.tail() >>ser.head(3)
3 40 0 10
4 50 1 20
5 60 2 30
6 70 dtype: int64
7 80
dtype: int64

50
Operations on Series
• Head() andObject
Tail() functions
– <pandas object>.head([n]) #To fetch first n rows from a pandas object, default is 5
– <pandas object>.tail([n]) #To fetch last n rows from a pandas object, default is 5

• >>ser=pd.Series([10,20,30,40,50,60,70,80])

>>ser.tail() >>ser.head(3) Print rows from

0 10
3 40 1 20 number 3 to 6
4 50 2 30
5 60
6 70 dtype: int64
7 80
dtype: int64
51
Operations on Series
Object
• Vector operations on Series Object
>>ser=pd.Series([10,20,30,40,50,60,70,80])

>>newser=ser**2 >>ser>20
>>ser*2
>>newser 0 False
0 20
0 100 1 False
1 40
1 400 2 True
2 60
2 900 3 True
3 80
3 1600 4 True
4 100
4 2500 5 True
5 120
5 3600 6 True
6 140
6 4900 7 True
7 160
7 6400 dtype: bool
dtype: int64
dtype: int64

52
Operations on Series
Object
• Arithmetic operations on Series Object
– Operation is performed only on the matching indexes
– If the data items of the two matches are not compatible to perform operation, it
will return NaN
>>ser1 >>ser2 >>ser1+ser2
0 10 0 30 0 40.0
1 20 1 40 1 60.0
2 30 2 50 2 80.0
3 40 3 60 3 100.0
4 50 dtype: 4 NaN
dtype: int64 dtype: float64
int64
53
Operations on Series
Object
>>ser1=pd.Series([10,20,30,40,50],['a','b','c','d','e'])
>>ser2=pd.Series((1,2,3,4,5,6),('a','b','e','g','d','h'))

>>ser1+ser2 >>ser1/ser2
a 11.0 a 10.000000
b 22.0 b 10.000000
c NaN c NaN
d 45.0 d 8.000000
e 53.0 e 16.666667
g. NaN g. NaN
h. NaN h. NaN
dtype: float64 dtype: float64
54
Operations on Series
Object
• Filtering Entries
- <Series Object>[<Boolean Expression on Series Object>]
>>ser1=pd.Series([10,20,30,40,50],['a','b','c','d','e'])

>>ser1>20 >>ser1[ser1>20]
a. False c. 30
b. False d. 40
c. True e. 50
d. Tru dtype: int64
e
e. True
dtype: bool

55
Operations on Series
Object
• Reindexing (creating object with a different order of same indexes)
- <Series Object>=<object>.reindex[<new index sequence>]
- Dropping Entries
- <Series Object>.drop(<index to be removed>)

>>ser1=pd.Series([10,20,30,40,50],['a','b','c','d','e'])
>>ser1.drop('e') >>ser1=pd.Series([10,20,30,40],['a','b','c','d'])
a. 10 >>ser2=ser1.reindex(['d','a','b','c'])
b. 20 >>ser2
c. 30 d
d. 40 40
dtype: int64 a. 10
b. 20
dtype:
c. 30 int64
56
Series Functions

Functions Description
Pandas Series.map() Map the values from two
series that have a common
column.
Pandas Series.std() Calculate the standard
deviation of the given set of
numbers, DataFrame,
column, and rows.
Pandas Series.to_frame() Convert the series object to
the dataframe.
Pandas Series.value_counts() Returns a Series that contain
counts of unique values.
Pandas Series.map()
The main task of map() is used to map the values from two series that have a common column. To map
the two Series, the last column of the first Series should be the same as the index column of the second
series, and the values should be unique.

Syntax: Series.map(arg, na_action=None)

Parameters
•arg: function, dict, or Series.
It refers to the mapping correspondence.
•na_action: {None, 'ignore'}, Default value None. If ignore, it returns null values, without passing it to the
mapping correspondence.

Returns
It returns the Pandas Series with the same index as a caller.
Pandas Series.map()

import pandas as pd
import numpy as np
a = pd.Series(['Java', 'C', 'C++', np.nan])
a.map({'Java': 'Core'})
Pandas Series.std()
The Pandas std() is defined as a function for calculating the standard deviation of the given set of
numbers, DataFrame, column, and rows. In respect to calculate the standard deviation, we need to
import the package named "statistics" for the calculation of median.
The standard deviation is normalized by N-1 by default and can be changed using the
ddof argument.
Syntax:
Series.std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwar
gs)

import pandas as pd
# calculate standard deviation
import numpy as np
print(np.std([4,7,2,1,6,3]))
print(np.std([6,9,15,2,-17,15,4]))
Numpy Arrays vs Pandas
Series
• Vector operation on 2 ndarrays with different shape will result
into error.
• In ndarrays, the indexes are always numeric starting from 0
onwards.
• But series objects can have any type of indexes, including
numbers (not necessarily starting from 0), letters, labels,
strings,etc….

61
Checkpoin
t
Predict output:
a) pd.Series(((10,20),(30,40)))[0]
b) pd.Series(((10,20),(30,40))).size

62
DataFrame Data
Structure
• 2-D labelled array and ordered collection of columns
• two axes- row index(axis=0) & column index(axis=1)
• elements are identifiable with the combination of row index
(index) and column index (column name).
• Indexes can be numbers or letters or strings
• Columns can have data of different types
• Values can be changed i.e. value mutable
• Add or delete rows/columns in a data frame i.e. size
mutable
63
Creating and Displaying a
DataFrame
• import pandas as pd
• import numpy as np
• <dataframe object>=pd.DataFrame(<2-D data structure>, [columns=<coulumn
sequence>], [index=<index sequence>])
• 2-D Structure can be passed as:
– 2-D dictionaries (dictionaries having lists or dictionaries or ndarrays or Series
objects etc…)
– 2-D ndarrays
– Series type object
– Another DataFrame object

64
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as lists/ndarrays

>>d1={'students':["Deepak","Abhijit","Neha","Swati","Shivansh"] , "age":
[30,32,28,30,2], "Sport":["Cricket","Volleyball","Football","Kabaddi","Athletics"]}
>>pd.DataFrame(d1)
students Sport
1 age
Deepak 30 Cricket • Keys of 2D dictionary have become columns
2 Abhijit 32 Volleyball • Index generated using np.range(n)
3 Neha 28 Football • Order of columns may not be reserved
4 Swati 30 Kabaddi
4 Shivansh 2 Athletics

65
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as dictionary object
>>> d1={"name":"Ravi","age":25,
"marks":60}
>>> d2={"name":"Anil","age":23,
"marks":75}
>>> d3={"name":"Asha", "age":20,
>>res={1:d1,2:d2,3:d3}
"marks":70} >>pd.DataFrame(res)
1 2 3
age 25 23 20
marks 60 75 70
name Ravi Anil Asha

66
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as dictionary object

>>> d1={"name":"Ravi","age":25,
"marks":60} Using from_dict, specify orient='index' to
>>> d2={"name":"Anil","age":23, create the DataFrame using dictionary
"marks":75} keys as rows:
>>> d3={"name":"Asha", "age":20,
>>res={1:d1,2:d2,3:d3}
"marks":70}
>>>>res={1:d1,2:d2,3:d3}
>>pd.DataFrame(res) >>pd.DataFrame.from_dict(res,orient='index')
1 2 name age marks
3 1 Ravi 25 60
marks
age 2560 23 20 2 Anil 23 75
name 75 70 3 Asha 20
Ravi Anil 70

Asha
67
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D dictionary having values as dictionary object

>>> d1={"name":"Ravi","age":25, "marks1":60}

>>> d2={"name2":"Anil","age":23, "marks":75}
>>> d3={"name":"Asha", "age3":20, "marks":70}

In case of inner dictionaries had non-matching keys:

1) #indexes = sum of unique inner keys
2) NaN would be used to depict missing values

68
Creating and Displaying a
DataFrame
• Creating DataFrame object using 2-D ndarray

>>data = np.array([[10,15],[20,25],[30,35],[40,45]])
>>pd.DataFrame(data)
0 1
0 10 15 >>data = np.array([[10,15],[20,25],[30,35],[40,45]],dtype=float)
1 20 25 >>pd.DataFrame(data, columns=["c1","c2"], index=[1,2,3,4])
2 30 35 c1 c2
3 40 45 1 10.0 15.0
2 20.0 25.0
3 30.0 35.0 Specifying own columns and indexes
4 40.0 45.0

69
Creating and Displaying a
DataFrame
• Creating DataFrame object from a 2D dictionary with values as Series Objects
>>population=pd.Series([7897667,4577637,6457324],index=["Delhi","Mumbai","Dehradun"])
>>avgincome=pd.Series([78976,45637,67324],index=["Delhi","Mumbai","Dehradun"])
>>percapita=avgincome/population
>>dict1={"population":population,"avg income":avgincome,"per capita":percapita}
>>pd.DataFrame(dict1)

To specify own order of columns:

>>pd.DataFrame(dict1,columns=["avg income","population","per capita"])

70
Creating and Displaying a
DataFrame
• Creating DataFrame object from another DataFrame object
population=pd.Series([7897667,4577637,6457324],index=["Delhi","Mumbai","Dehradun"])
avgincome=pd.Series([78976,45637,67324],index=["Delhi","Mumbai","Dehradun"])
percapita=avgincome/population
dict1={"population":population,"avg income":avgincome,"per capita":percapita}
dtf1=pd.DataFrame(dict1)
dtf2=pd.DataFrame(dtf1,columns=["avg income","population","per capita"])
print(dtf2)

71
DataFrame
Attributes
>>dtf=pd.DataFrame(res)
1 2 3
age 25 23 20
marks 60 75 70
name Ravi Anil Asha

• Index (returns Indexes) • Values (Returns numpy representation )

>>dtf.index >>dtf.values
Index(['age', 'marks', 'name'], dtype='object') array([[25, 23, 20],
[60, 75, 70],
['Ravi', 'Anil',
'Asha']],
dtype=object)
72
DataFrame
Attributes
• Axes (returns list representing both the axes)
>>dtf.axes
[Index(['age', 'marks', 'name'], dtype='object'),
Int64Index([1, 2, 3], dtype='int64')]

• size (returns int representing number of elements)

>>dtf.siz
e 9

• shap
e
(retu
rns
tupl 73
DataFrame Attributes
• empty
• >>dtf.empty
• False Transposing a DataFrame

>>dtf.T
• len (to find number of rows)
age marks
• >>len(dtf) name 1 25
•3 60
Ravi
• dtypes 2 23 75 Anil
• >>dtf.dtypes 3 20 70 Asha
1 object
2 object dtype: object 74
DataFrame
Attributes
Count (to count non NaN values)

count() or count(0) or count (axis=‘index’) Count(1) or count (axis=‘columns’)

dtf.count(axis='index') dtf.count(axis='columns')
1 3 age 3
2 3 marks 3
3 3 name 3
dtype: int64 dtype: int64

75
Selecting or Accessing a
Data
• Selecting a Column >>dtf
students age Sport
– <DataFrame object>[<Column name>] 1 Deepak 30 Cricket
2 Abhijit 32 Volleyball
OR
3 Neha 28 Football
– <DataFrame object>.<column name> 4 Swati 30 Kabaddi
4 Shivansh 2 Athletics
>>dtf['students'] >>dtf.students
0 Deepak 1 Deepak
1 Abhijit 2 Abhijit
2 Neha 3 Neha
3 Swati 4 Swati
4 Shivansh 5 Shivansh
Name: students, dtype: object Name: students, dtype: object

76
Selecting or Accessing a
Data
• Selecting multiple Columns >>dtf
students age Sport
– <DataFrame object>[[<col name> ,<col name>…]] 0 Deepak Cricket
1 Abhijit
30 32 Volleyball
2 Neha 28 Football
3 Swati 30 Kabaddi
>>dtf[['students','age']] 4 Shivansh 2 Athletics
students age
0 Deepak
1 30
2 Abhijit 32
3 Neha
4 Shivansh
28
Swati
30
2
77
Selecting or Accessing a
Data
• Selecting subset from a dataframe using row column names
– <DataFrame object>.loc[<start row>:<end row>,<start column>:<end column>]
>>dtf.loc[0 , : ] #To access one row >>dtf.loc[0:2 , : ] #to access multiple rows
students Deepak students age Sport
age 30 0 Deepak 30 Cricket
Sport Cricket 1 Abhijit 32 Volleyball
Name: 0, dtype: object 2 Neha 28 Football

>>dtf.loc[: ,"age":"Sport"] #to access selective col >> dtf.loc[2:4,"age":"Sport"] #selective rows & col
age Sport age Sport
1 30 Cricket 2 28 Football
2 32 Volleyball 3 30 Kabaddi
3 28 Football 4 2 Athletics End index is included
4 30 Kabaddi
5 2 Athletics
78
Selecting or Accessing a
Data
Obtaining subset/slice using row/column numeric index/position
<DF object>.iloc[<start row index>:<end row index >,<start column index>:<end column index>]

>>dtf.iloc[0:2,1:3]
age Sport
1 30 Cricket
2 32 Volleyball
End index is excluded
>>dtf.iloc[2:4]
students age
Sport
2 Neha 28
Football
3 Swati 30
Kabaddi
79
Selecting or Accessing a
Data
• Selecting/Accessing Individual values
<DF object>.<column>[<row name or row numeric index>]
>>dtf.students[4 >>dtf.students[[4,3]]
] 'Shivansh' 4
Shivansh
Name:
3 students, dtype: object
Swati
<DF object>.at[<row label>,<col label>]
<DF object>.iat[<row index no>,<col index number>]

dtf.at[2,"students" dtf.iat[2,2
] 'Neha' ]
'Football'

80
Assigning/Modifying Data Values in
Dataframes
• To change or add a column
<DF object>[<column name>]=<new value>
>>dtf1=pd.DataFrame(dtf) Since No column with name country therefore new column added
>>dtf1["country"]="india"
>>dtf1
>>dtf1["country"]="Bharat"
students Sport countr
>>dtf1
1 age
Deepa 30 Cricket
y
students age Sport country
k 32 Volleyball
india
1 Deepak 30 Cricket Bharat
2 Abhijit india indi
2 Abhijit 32 Volleyball Bharat
3 Neha 28 Football a
3 Neha 28 Football Bharat
Swati 30 Kabaddi india
4 Shivansh
4 Swati 30 Kabaddi Bharat
2 Athletics
4 Shivansh 2 Athletics Bharat
india
81
Assigning/Modifying Data Values in
Dataframes
• To change or add a row
<DF object>.at[< row name>]=<new value>
dtf1.at[6]=20 >>dtf1.at[5]=["Ravi",56,"Badminton","Bharat"]
dtf1 >>dtf1
students Sport country students Sport country
1 Deepak 30.0 Cricket Bharat 1 age
Deepak 30.0 Cricket Bharat
2 age
Abhijit 32.0 Volleyball 2Abhijit 32.0 Volleyball Bharat
3 Neha 28.0 Football Bharat 2 Neha 28.0 Football Bharat
4 Swati 30.0 Kabaddi 3 Swati 30.0 Kabaddi Bharat
5 Shivansh 2.0 Athletics Bharat 4 Shivansh 2.0 Athletics Bharat
6 20 20.0 20 6 20 20.0 20 20
Bharat 5 Ravi 56.0 Badminton Bharat

No row with name 6 , new row added Bharat 82

Assigning/Modifying Data Values in
Dataframes
• To change or modify single data value
<DF object>.<col name>[<row name>]=<new value>
dtf.age[1]=45
dtf
students Sport country
age Cricket Bharat
1 Deepak
30 Volleyball Bharat
2 Abhijit Football
45 Kabaddi Bharat
3 Neha 28 Athletics
4 Swati 30 Bharat
5 Shivansh
83
2 Bharat
deleting rows and
columns
>>dtf1.drop(columns="country")
>>dtf1.drop(range(5,8)) #Index positions to be deleted
students Sport students age Sport country
1 age
Deepak 30.0 1 Deepak 30.0 Cricket Bharat
Cricket 2Abhijit 45.0 Volleyball Bharat 2
2Abhijit 45.0 Volleyball
Football Neha 28.0 Football Bharat
2 Neha 28.0 Kabaddi 3 Swati 30.0 Kabaddi Bharat
4
3 Shivansh 2.0 Athletics
Swati 30.0 4 Shivansh 2.0 Athletics Bharat
6 20 20.0 20
5 67 67.0 >>del dtf1["country"]
Badminton 7 67 67.0 >>dtf1
students age Sport
67
1 Deepak 30.0 Cricke
2 Abhijit 45.0 Volleyball
t
3 Neha 28.0 Football
4 Swati 30.0 Kabaddi
4 Shivansh 2.0 Athletics
84
Iterating over a
DataFrame
•DataFrame follow the dict-like convention of iterating over the “keys” of the objects.
Basic iteration (for i in object) produces:
• Series : values
• DataFrame : column labels
population=pd.Series([7897667,4577637,6457324],index=["Delhi","Mumbai","Dehradun"])
avgincome=pd.Series([78976,45637,67324],index=["Delhi","Mumbai","Dehradun"])
percapita=avgincome/population
dict1={"population":population,"avg income":avgincome,"per capita":percapita}
dtf1=pd.DataFrame(dict1)
for i in dtf1:
print(i)
p
o
p
u
l 85
Iterating over a
DataFrame
• iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This
converts the rows to Series objects, which can change the dtypes.
• iteritems(): Iterate over the columns of a DataFrame as (Column, Series) pairs.
Row name: Delhi
for row,rowseries in dtf1.iterrows(): Values:
print("Row name:",row) population 7897667.00
print("Values:\n",rowseries) avg 78976.00
income 0.01
per capita
Name: Delhi, dtype: float64
Row name: Mumbai
Values:
population 4.577637e+06
avg 4.563700e+04
income 9.969554e-03
per capita
Name: Mumbai, dtype: float64
Row name: Dehradun
86
Iterating over a
DataFrame
Values:
for col,colseries in dtf1.iteritems(): Col name: population
Delhi 7897667
print("Col name:",col) Mumbai 4577637
print("Values:\n",colseries) Dehradun 6457324
Name: population, dtype: int64
Col name: avg income
Values:
Delhi 78976
Mumbai 45637
Dehradun 67324
Name: avg income, dtype: int64
Col name: per capita
Values:
Delhi 0.010000
Mumbai 0.009970
Dehradun 0.010426
Name: per capita, dtype: float64
87
Binary operations in a
DataFrame
dtf1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
dtf2=pd.DataFrame([[10,20,30],[40,50,60],[70,80,90]])

• Add or + • Sub or -
>>dtf1+dtf2 >>dtf1.add(dtf2) >>dtf2-dtf1 >>dtf2.sub(dtf1)
0 1 2 0 1 2 0 0 1
0 11 22 33 0 11 22 33 2
1 44 55 66 1 44 55 66 1 0 9 18 27
2 77 88 99 2 77 88 99 1 36 45 54
2 2 63 72 81
>>dtf1.radd(dtf2) # dtf2 + dtf1 , Check with strings >>dtf2.rsub(dtf1) #dtf1 – dtf2
0 1 2 0 09 18 27 1 2
0 11 22 33 10 36 45
-9 -18 -27
1 44 55 66 154-36 -45 -54
2 77 88 99 22 -63
63 -72
72 -81
81
88
Binary operations in a
DataFrame
dtf1=pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
dtf2=pd.DataFrame([[10,20,30],[40,50,60],[70,80,90]])

• Div or / • Mul or *
>>dtf1/dtf2 >>dtf1.div(dtf2) >>dtf1*dtf2 >>dtf1.mul(dtf2)
0 1 2 0 1 0 0 1
0 0.1 0.1 2 2
0.1 0 0.1 0.1 0.1 1 0 10 40 90
1 0.1 0.1 1 0.1 0.1 0.1 1 160 250 360
0.1 2 0.1 0.1 0.1 2 2 490 640 810
2 0.1 0.1 >>dtf1.rdiv(dtf2) 0 10 40
0.1 0 1 2 90
0 10.0 10.0 10.0 1 160 250
1 10.0 10.0 10.0 360
2 10.0 10.0 10.0 2 490 640
810 89
Essential
Functions
• Inspection Functions info() and describe()
>>dtf1
>>dtf1.describe()
0 1 2
0 1 2 3 0
1 4 5 6 1 #Count of non NA values
2 7 8 9 2 # mean of values in column
count 3.0 3.0 3.0 #Std. Deviation
>>dtf1.info()
min
mean 1.0
4.02.0
5.03.0
6.0 # min value
<class 'pandas.core.frame.DataFrame'>
25% 2.53.0
std 3.0 3.53.0
4.5 # 25th percentile
RangeIndex: 3 entries, 0 to 2
50% 4.0 5.0 6.0
Data columns (total 3 columns):
75% 5.5 6.5 7.5
1 3 non-null int64
max 7.0 8.0 9.0 # Max value
2 3 non-null int64
3 3 non-null int64
dtypes: int64(3)
memory usage: 112.0 bytes
69
Essential
Functions
• describe() functions in case of string column
>>dtf
age marks
name 1 25
60
Ravi
2
>>dtf.describe()
23 75 Count: #non NA entries in the column
Anil
age marks Unique: #unique entries in the col
3 20name
count 370 3 3 Top: most common entry in the col /highest frequency
Asha 3 3 3
unique Freq: frequency of the most common element
top 23 75 Anil
freq 1 1 1 91
Essential
Functions
• Head() and Tail()
– <DF object>.head([n]) #rows from top, default value 5
– <DF object>.tail([n]) #rows from bottom, default value 5
>>dtf1 >>dtf1.head(4) >>dtf1.head(7).tail(2)
0 1 0 1 0 1 2
0 1.0 2.02 3.0 0 1.0 2.02 3.0 5 13.0 14.0 15.0
1 4.0 5.0 6.0 1 4.0 5.0 6.0 6 16.0 17.0 18.0
2 7.0 8.0 9.0 2 7.0 8.0 9.0
4 10.0 11.0 12.0 4 10.0 11.0 12.0
5 13.0 14.0 15.0
6 16.0 17.0 18.0
92
Essential
Functions
• Cumulative Calculation Functions
– <DF>.cumsum([axis=None]) # default is rows
– Calculates cumulative sum ( sum of each row is replaced by sum of all prior rows)

>>dtf1 >>dtf1.cumsum() >>dtf1.cumprod()

0 1 2 0 1 2 0 1 2
0 1.0 2.0 3.0 0 1.0 2.0 3.0 0 1.0 2.0 3.0
1 4.0 5.0 6.0 1 5.0 7.0 9.0 1 4.0 10.0 18.0
2 7.0 8.0 9.0 2 12.0 15.0 18.0 2 28.0 80.0 162.0
4 10.0 11.0 12.0 4 22.0 26.0 30.0 4 280.0 880.0 1944.0
5 13.0 14.0 15.0 5 35.0 40.0 45.0 5 3640.0 12320.0 29160.0
6 16.0 17.0 18.0 6 51.0 57.0 63.0 6 58240.0 209440.0
524880.0
93
Essential
Functions
• Index of Maximum and Minimum Values
– <DF>.idxmax([axis]) #default is 0
– <DF>.idxmin([axis])
>>dtf1.idxmin(axis=1) >>dtf1.idxmax(axis=1)
>>dtf1 >>dtf1.idxmin()
0 0 0 2
0 1 0 0
0 1.0 2.02 3.0 1 0 1 2
1 0
1 4.0 5.0 6.0 2 0 2 2
2 0
2 7.0 8.0 9.0 4 0 4 2
dtype: int64
4 10.0 11.0 12.0 5 0 5 2
5 13.0 14.0 15.0 6 0 6 2
6 16.0 17.0 18.0 dtype: int64 dtype: int64

94
Matching and Broadcasting
Operations
• Matching: When performing arithmetic operations, data is aligned on
the basis of matching indexes and then perform arithmetic; for non-
overlapping arithmetic operations result in NaN .
>>dtf1 >>dtf1*4 Broadcasting: Smaller
0 1 2 0 1 2
0 1 2 0 1 object is broadcast
2 0 1.0 2.0 3.0 0 4 4 4
across the size of larger
0 1.0 2.0 3.0 0 4.0 8.0 1 4.0 5.0 6.0 1 4 4 4
12.0 2 4 4 4 Object for compatible
2 7.0 8.0 9.0
1 4.0 5.0 6.0 1 16.0 20.0 24.0 3 4 4 4 Shapes.
4 10.0 11.0 12.0
2 7.0 8.0 9.0 2 28.0 32.0 36.0 5 13.0 14.0 15.0 4 4 4 4
5 13.0 14.0 15.0 5 52.0 56.0 60.0
4 10.0 11.0 12.0 4 40.0 44.0 48.0 6 16.0 17.0 18.0 5 4 4 4
6 16.0 17.0 18.0 6 64.0 68.0 72.0
6 4 4 4

95
Handling Missing
Data
• Missing values are the values that cannot contribute to any
computation. (NULL or NaN or None).
• Sources of Missing Values
– User forgot to fill in a field.
– Data was lost while transferring manually from a legacy
database.
– There was a programming error.

96
Handling Missing
Data
Creating data with missing values:
df = pd.DataFrame(np.random.randn(5,3), index=['a', 'c', 'e', 'f','h'],\
columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
one two
three
a 0.297695 -1.122686
0.707862
b NaN NaN
NaN
c 0.992830 -0.421574
0.922447
d NaN NaN
NaN 97
Handling Missing
Data
Check for missing values: isnull() and notnull() functions
print(df.isnull()) print(df.notnull()) print(df['two'].notnull())
one two three a. True
one two three
a False False False a True True True b. False
b True True True b False False False c. True
c False False False c True True True d. False
d True True True d False False False e. True
e False False False e True True True f. True
f True True True g. False
f False False
g False False False h. Tru
g True True True
False
h True True True Name:
e two, dtype: bool
h False False
False 98
Handling Missing
Data
• Calculations with Missing Data
– When summing data, NaN will be treated as Zero
print(df['two'].sum())
-0.5127836471871685

• Dropping Missing Data:

• dropna() : Drop all rows that have NaN values (even row with single NaN)
• dropna(how=“all”): drop only rows that have all NaN values
one two three
a -2.388178 -0.186296 0.205303
print(df.dropna()) c 1.011027 0.183877 -1.329763
e -0.412281 -0.745574 0.184188
f 0.154785 1.360204 0.228896
h -0.731504 -0.484477 1.920194
99
Handling Missing
• Filling Missing Data
Data
– fillna(<n>) : will fill all NaN values with value n
– fillna(<dictionary having fill values for each columns>)

print(df.fillna({"one":10,"two":20,"three":30}))
print(df.fillna(7))
one two three a - one two three
0.210141 0.802803 0.765983 a 1.567348 0.343046 -2.076060
b 7.000000 7.000000 7.000000 b 10.000000 20.000000 30.000000
c -1.079648 -1.270144 -2.003383 c 0.460282 -0.398845 1.390478
d 7.000000 7.000000 7.000000 d 10.000000 20.000000 30.000000
e -0.377301 0.183492 0.347551 e 1.245430 1.633978 1.818819
f 1.457437 0.167696 1.045180 f 1.643829 -0.756501 -0.378391
g 7.000000 7.000000 7.000000 g 10.000000 20.000000 30.000000
h 0.957371 -0.000443 0.648315 h -0.282579 1.307354 0.440571

100
Handling Missing
• Filling Missing Data
Data
– fillna(<n>) : will fill all NaN values with value n
– fillna(<dictionary having fill values for each columns>)

101
Comparisons of Pandas
objects
• np.NaN==np.NaN -> False
df1=pd.DataFrame([(1,2,3),(4,5,6),(7,8,9)],columns=["A","B","C"])
print(df1)
df2=pd.DataFrame([(11,22,np.NaN),(44,55,np.NaN),(77,88,99)],columns=["A","B","C"])
print(df2)
print(df1+df2==df1.add(df2))
print((df1+df2).equals(df1.a
dd(df2))) # Returns True if
2 NaN values are Acompared
B C
0 1 2 3
1 4 5 6 A
2 7 8 9

A B C B
0 11 22 NaN
1 44 55 NaN
2 77 88 99.0 C 102
Combining
DataFrames
• Using concat: (concat all the rows)
one = pd.DataFrame({ Name subject_id Marks_scored
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 1 Alex sub1 98
'Ayoung'], 2 Amy sub2 90
3 Allen sub4 87
'subject_id':['sub1','sub2','sub4','sub6','sub5'], 4 Alice sub6 69
'Marks_scored':[98,90,87,69,78]}, 5 Ayoung sub5 78
index=[1,2,3,4,5]) 1 Billy sub2 89
2 Brian sub4 80
id Name subject_id 3 Bran sub3 79
two = pd.DataFrame({ 0 1 Alex sub1 4 Bryce sub6 97
'Name': ['Billy', 'Brian',
1 2 'Bran',
Amy 'Bryce',
sub2 5 Betty sub5 88
'Betty'], 2 3 Allen sub4
3 4 Alice sub6
'subject_id':['sub2','sub4','sub3','sub6','sub5'],
4 5 Ayoung sub5
'Marks_scored':[89,80,79,97,88]},
index=[1,2,3,4,5])
print(pd.concat([one,two],axis=0))
Deepak Sharma, Asst. Professor, UPES Dehradun
82
Combining
DataFrames
• Using merge: (two rows with some common values)
left = pd.DataFrame( { 'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame(
{'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce',
'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
id Name subject_id id Name subject_id
print(left)
print (right) 0 1 Billy sub2
0 1 Alex sub1
1 2 Amy sub2 1 2 Brian sub4
2 3 Allen sub4 2 3 Bran sub3
3 4 Alice sub6 3 4 Bryce sub6
4 5 Ayoung sub5 4 5 Betty sub5

104
Combining
DataFrames
• Merge Two DataFrames on a Key id Name id Name subject_id
subject_id 1 1 Billy sub2
1 1 Alex sub1 2 2 Brian sub4
print (pd.merge(left,right, 2 23 Allen
Amy sub4sub2 2 3 Bran sub3
on='id')) 3 4 Alice sub6 3 4 Bryce sub6
4 5 Ayoung sub5 4 5 Betty sub5
id Name_x subject_id_x Name_y subject_id_y
0 1 Alex sub1 Billy sub2
1 2 Amy sub2 Brian sub4
2 3 Allen sub4 Bran sub3
3 4 Alice sub6 Bryce sub6
4 5 Ayoung sub5 Betty sub5

105
Combining
• Merge Using 'how' Argument
DataFrames id Name subject_id id Name subject_id
0 1 Alex sub1 0 1 Billy sub2
1 2 Amy sub2 1 2 Brian sub4
Merge Method SQL Equivalent Description 2 3 Allen sub4 2 3 Bran sub3
3 4 Alice sub6 3 4 Bryce sub6
left LEFT OUTER JOIN Use keys from left 4 5 Ayoung sub5 4 5 Betty sub5
object

right RIGHT OUTER Use keys from print (pd.merge(left, right, on='subject_id',
JOIN right object how='left‘))
Name_x id_x subject_id Name_y id_y
outer FULL OUTER JOIN Use union of keys 0 Alex 1 sub1 NaN NaN
inner INNER JOIN Use intersection 1 Amy 2 sub2 Billy 1.0
of keys 2 Allen 3 sub4 Brian 2.0
3 Alice 4 sub6 Bryce 4.0
4 5 sub5 Betty 5.0
Ayoung

106

Python Pandas
No ratings yet
Python Pandas
230 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
138 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
XII IP CH 1 Python Pandas - I Series
No ratings yet
XII IP CH 1 Python Pandas - I Series
45 pages
Informatics Practices Class 12
No ratings yet
Informatics Practices Class 12
225 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:25
135 pages
Mysore 50k
No ratings yet
Mysore 50k
1,144 pages
CH 1 Python Pandas-I
No ratings yet
CH 1 Python Pandas-I
13 pages
Data Analysis and Visualization Using Python Libraries and Streamlit - RTF Pre Read Materials
No ratings yet
Data Analysis and Visualization Using Python Libraries and Streamlit - RTF Pre Read Materials
29 pages
Pandas Class XII (2021-22)
No ratings yet
Pandas Class XII (2021-22)
246 pages
Python Pandas
No ratings yet
Python Pandas
96 pages
Data Analytics Pandas
No ratings yet
Data Analytics Pandas
33 pages
Numpy Basics Introduction To
No ratings yet
Numpy Basics Introduction To
35 pages
Pandas
No ratings yet
Pandas
82 pages
Data Handling Using Pandas - I
No ratings yet
Data Handling Using Pandas - I
42 pages
Autobiographical Elements in Paradise Lost
80% (5)
Autobiographical Elements in Paradise Lost
3 pages
Httpsncert Nic Intextbookpdfleip102 PDF
No ratings yet
Httpsncert Nic Intextbookpdfleip102 PDF
36 pages
Unit III Part 2 1725700061785
No ratings yet
Unit III Part 2 1725700061785
85 pages
Pandas
No ratings yet
Pandas
57 pages
Ip Chapter 1
No ratings yet
Ip Chapter 1
36 pages
DAY6 Pandas Seaborn
No ratings yet
DAY6 Pandas Seaborn
97 pages
Folk Socks BLAD
14% (14)
Folk Socks BLAD
8 pages
Unit 5
No ratings yet
Unit 5
40 pages
Working With Pandas Notes
No ratings yet
Working With Pandas Notes
27 pages
Python Pandas
No ratings yet
Python Pandas
22 pages
Unit - 1 - Python Pandas
No ratings yet
Unit - 1 - Python Pandas
176 pages
Pandas
No ratings yet
Pandas
163 pages
Python Code
No ratings yet
Python Code
44 pages
IP 12th Chapter 2
No ratings yet
IP 12th Chapter 2
8 pages
Unit 2
No ratings yet
Unit 2
81 pages
Leip 102
No ratings yet
Leip 102
36 pages
Old English Calligraphy
No ratings yet
Old English Calligraphy
9 pages
SR Ip Pandas I Full Notes
No ratings yet
SR Ip Pandas I Full Notes
30 pages
Data Handling Using Pandas-1 - Series Object Notes PDF
No ratings yet
Data Handling Using Pandas-1 - Series Object Notes PDF
25 pages
Data Handling Python NCERT
No ratings yet
Data Handling Python NCERT
36 pages
P03 Introduction To Pandas Ans
No ratings yet
P03 Introduction To Pandas Ans
45 pages
14 Pandas
No ratings yet
14 Pandas
25 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
23 pages
Ncert Pandas
No ratings yet
Ncert Pandas
36 pages
Exp 25 - 26
No ratings yet
Exp 25 - 26
17 pages
Python Pandas - I
No ratings yet
Python Pandas - I
32 pages
Python Pandas - Series Notes
No ratings yet
Python Pandas - Series Notes
13 pages
Panda Ncert 1
No ratings yet
Panda Ncert 1
36 pages
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
No ratings yet
Unit I: Data Handling Using Pandas and Data Visualization: Marks:30
75 pages
Pandas Notes
No ratings yet
Pandas Notes
19 pages
Pandas
No ratings yet
Pandas
20 pages
Unit - V Introduction To Pandas in Python
No ratings yet
Unit - V Introduction To Pandas in Python
21 pages
Pandas Python
No ratings yet
Pandas Python
11 pages
2.1 Pandas Objects
No ratings yet
2.1 Pandas Objects
10 pages
Ip 102
No ratings yet
Ip 102
36 pages
UNIT 3 (Chapter 2) Pandas
No ratings yet
UNIT 3 (Chapter 2) Pandas
43 pages
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
No ratings yet
XII - Ip - Panda - I - Part - I - 2023 (1) 1 1
25 pages
CH 2
No ratings yet
CH 2
36 pages
Python Pandas
100% (1)
Python Pandas
35 pages
Pandas Notes 1
No ratings yet
Pandas Notes 1
6 pages
ML Lab8
No ratings yet
ML Lab8
28 pages
PYTHON UNIT-5 Part-C
No ratings yet
PYTHON UNIT-5 Part-C
4 pages
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
No ratings yet
Ln. 1 - Data Handling Using Pandas - Series & Dataframe
14 pages
Pandas
No ratings yet
Pandas
3 pages
UNIT - 3 Pandas
No ratings yet
UNIT - 3 Pandas
21 pages
Data Handling Using Pandas - 1-2-1
No ratings yet
Data Handling Using Pandas - 1-2-1
10 pages
Data Handling Using Pandas I - Series
No ratings yet
Data Handling Using Pandas I - Series
11 pages
Pandas Notoes For XII PDF
No ratings yet
Pandas Notoes For XII PDF
12 pages
Asian Parliamentary Debate Guidelines
100% (1)
Asian Parliamentary Debate Guidelines
2 pages
Short Stories in Teaching English
100% (1)
Short Stories in Teaching English
32 pages
Cuadernillo 3° Año 2024
No ratings yet
Cuadernillo 3° Año 2024
105 pages
Implementing Canvas Open Source
No ratings yet
Implementing Canvas Open Source
24 pages
SD Profile PDF
No ratings yet
SD Profile PDF
1 page
K MP - Third Draft
No ratings yet
K MP - Third Draft
42 pages
Math IA EXPLORATION - Instructions
No ratings yet
Math IA EXPLORATION - Instructions
15 pages
(Ebooks PDF) Download Algorithms Design Techniques and Analysis M H Alsuwaiyel Full Chapters
100% (1)
(Ebooks PDF) Download Algorithms Design Techniques and Analysis M H Alsuwaiyel Full Chapters
51 pages
Short Story Task Sheet and Rubric
No ratings yet
Short Story Task Sheet and Rubric
3 pages
Anne Sullivan Famous Educator 1
No ratings yet
Anne Sullivan Famous Educator 1
1 page
Data Warehouse KT
No ratings yet
Data Warehouse KT
11 pages
Dlpmediajuly 2
No ratings yet
Dlpmediajuly 2
2 pages
32 Principal-14 Jagatsinghpur - Bio
No ratings yet
32 Principal-14 Jagatsinghpur - Bio
1 page
A Semantico-Pragmatic Study of Synecdoche in The Glorious Quran
No ratings yet
A Semantico-Pragmatic Study of Synecdoche in The Glorious Quran
18 pages
Connecting Ideas With And: and Connects Only and A Mouse. and Connects or and A Dog. and Connects and Vou
No ratings yet
Connecting Ideas With And: and Connects Only and A Mouse. and Connects or and A Dog. and Connects and Vou
5 pages
Sonority
No ratings yet
Sonority
11 pages
Tema Vacanta Engleza
No ratings yet
Tema Vacanta Engleza
16 pages
Method of Substitution
No ratings yet
Method of Substitution
5 pages
Glazed Tiles of Multan
No ratings yet
Glazed Tiles of Multan
4 pages
Exacqvision RTSP Server Guide
No ratings yet
Exacqvision RTSP Server Guide
9 pages
RefDef+Cwk+BITS+ +Case+Study
No ratings yet
RefDef+Cwk+BITS+ +Case+Study
3 pages
Penerapan Finite State Automata Pada: Vending Machine Susu Kambing Etawa
No ratings yet
Penerapan Finite State Automata Pada: Vending Machine Susu Kambing Etawa
6 pages
Sketch The Domain of Functions
No ratings yet
Sketch The Domain of Functions
3 pages
PO1 Comparative Analysis Essay
No ratings yet
PO1 Comparative Analysis Essay
2 pages
Pujitha Resume For Tcs
No ratings yet
Pujitha Resume For Tcs
2 pages
Section11-Data Manipulation Skills Checklist - Pages
No ratings yet
Section11-Data Manipulation Skills Checklist - Pages
2 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet

Python Unit - 6 Pandas

Uploaded by

Python Unit - 6 Pandas

Uploaded by

•PANDAS

•(PANel DAta System)

• Data analysis requires lots of processing, such as restructuring, cleaning or merging,

data, i.e., load, manipulate, prepare, model, and analyze.

• Used for reshaping and pivoting of the data sets.

• Group by data for aggregations and transformations.

• Provide the functionality of Time Series.

clear and concise code for the user.

data, i.e., Series and DataFrame, which are discussed below:

Data: It can be any list, dictionary, or scalar value.

use array() function in the program.

two different indexes, i.e., row index and column index.

It consists of the following properties:

is denoted as "columns" in case of columns and "index" in case of rows.

# Calling DataFrame constructor on list

Explanation: In this code, we have defined a variable named "x"

O/P: Array: [1 2 3 4] Array:

O/P: <class 'numpy.ndarray'> O/P: (2, 3)

Axes -> dimensions in numpy array

The axes of an array describe the

data: It can be any list, dictionary, or scalar value.

not pass any index, default np.arrange(n) will be used.

dtype: It refers to the data type of series.

copy: It is used for copying the data.

Creating a Series: We can create a Series in two ways:

#import pandas library

Series.index Defines the index of the Series.

Series.dtype It returns the data type of the data.

Series.size It returns the size of the data.

Series.nbytes It returns the number of bytes in the data.

Series.ndim It returns the number of dimensions in the data.

Series.itemsize It returns the size of the datatype of item.

Data can be:

>>ser.tail() >>ser.head(3) Print rows from

Syntax: Series.map(arg, na_action=None)

>>> d1={"name":"Ravi","age":25, "marks1":60}

In case of inner dictionaries had non-matching keys:

To specify own order of columns:

• Index (returns Indexes) • Values (Returns numpy representation )

• size (returns int representing number of elements)

count() or count(0) or count (axis=‘index’) Count(1) or count (axis=‘columns’)

No row with name 6 , new row added Bharat 82

>>dtf1 >>dtf1.cumsum() >>dtf1.cumprod()

• Dropping Missing Data:

You might also like