Data Science - Unit-3-Part-2
Data Science - Unit-3-Part-2
UNIT-3 Part-2
Data manipulation with Pandas
Syllabus: Data manipulation with Pandas – data indexing and selection – operating on data
– missing data – hierarchical indexing – combining datasets –aggregation and grouping –
pivot tables.
Pandas
Pandas is a newer package built on top of NumPy, and provides an
efficient implementation of a DataFrame.
DataFrames are essentially multidimensional arrays with attached row
and column labels, and often with heterogeneous types and/or missing
data.
As well as offering a convenient storage interface for labeled data, Pandas
implements a number of powerful data operations familiar to users of
both database frameworks and spreadsheet programs.
Pandas Objects
(Fundamental Pandas Data Structures)
Three fundamental Pandas data structures are:
Series
DataFrame
Index.
The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data.
Example: import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0])
print(data)
Output:
0 0.25
1 0.50
3 0.75
3 1.00
The Series wraps both a sequence of values and a sequence of indices,
which we can access with the values and index attributes. The index is an
array-like object of type pd.Index,
Example:
print(data.values)
print(data.index)
Output:
[0.25 0.5 0.75 1. ]
RangeIndex(start=0, stop=4, step=1)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 1
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Output:
0 10
1 20
2 30
3 40
4 50
0 10
1 20
2 30
3 40
4 50
0 10
a 10
b 10
c 10
d 10
e 10
1st 10
2nd 20
3rd 30
4th 40
5th 50
The Pandas DataFrame Object
The DataFrame can be thought of either as a generalization of a NumPy
array, or as a specialization of a Python dictionary.
A DataFrame is an analog of a two-dimensional array with both flexible
row indices and flexible column names.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 3
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
col1 col2
0 10 20
1 30 40
2 50 60
col1 col2
row1 10 20
row2 30 40
row3 50 60
Constructing DataFrame objects
A Pandas DataFrame can be constructed in a variety of ways.
From a single Series object
From List of Dicts
From a dictionary of Series objects
From a two-dimensional NumPy array
From a NumPy structured array
From a single Series object:
A DataFrame is a collection of Series objects, and a single column
DataFrame can be constructed from a single Series:
Example:
import pandas as pd
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 4
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
marks = pd.Series(markslist)
df= pd.DataFrame(marks,columns=['Marks'])
print(df)
Output:
Marks
kumar 89
Rao 78
Ali 67
Singh 96
From List of Dicts:
Any list of dictionaries can be made into a DataFrame.
Example:
import pandas as pd
import numpy as np
data = [{'a':i,'b':2*i} for i in range(3)]
print(pd.DataFrame(data))
#alternate way of defining
l1={'a':0,'b':0}
l2={'a':1,'b':2}
l3={'a':2,'b':4}
data = [l1,l2,l3]
print('\n',pd.DataFrame(data))
Output:
a b
0 0 0
1 1 2
2 2 4
a b
0 0 0
1 1 2
2 2 4
From a dictionary of Series objects:
A DataFrame can be constructed from a dictionary of Series objects
Example:
import pandas as pd
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
ageslist = {'kumar':21,'Rao':22,'Ali':19,'Singh':20}
marks = pd.Series(markslist)
ages = pd.Series(ageslist)
df = pd.DataFrame({'marks': marks,'ages': ages})
print(df)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 5
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
From a two-dimensional NumPy array.
Given a two-dimensional array of data, we can create a DataFrame with
any specified column and index names. If omitted, an integer index will
be used for each
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(1,7,1).reshape(3,2),
columns=['col1', 'col2'],
index=['row1', 'row2', 'row3'])
print(df)
Output:
col1 col2
row1 1 2
row2 3 4
row3 5 6
From a NumPy structured array.
A Pandas DataFrame operates much like a structured array, and can be
created directly from one:
Example:
import numpy as np
import pandas as pd
sa = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
print(pd.DataFrame(sa))
Output:
A B
0 0 0.0
1 0 0.0
2 0 0.0
Pandas Index Object
Both the Series and DataFrame objects contain an explicit index using
which we reference and modify data.
This Index object is an interesting structure in itself, and it can be thought
of either as an immutable array or as an ordered set.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 6
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Example:
import pandas as pd
rind = pd.Index(['row1','row2','row3','row4'])
cind =pd.Index(['col1'])
ser = pd.Series([100,200,300,400],index=rind)
df = pd.DataFrame(ser,columns=cind)
print(df)
Output:
col1
row1 100
row2 200
row3 300
row4 400
import pandas as pd
rind = pd.Index(['row1','row2','row3','row4'])
ser1 = pd.Series([10,20,30,40],index=rind)
ser2 = pd.Series([50,60,70,80],index=rind)
frame={'col1':ser1,'col2':ser2}
df = pd.DataFrame(frame)
print(df)
Output:
col1 col2
row1 10 50
row2 20 60
row3 30 70
row4 40 80
Operating on Data in Pandas
Pandas inherit much of this functionality from NumPy, and the ufuncs.
So Pandas having the ability to perform quick element-wise operations,
both with basic arithmetic (addition, subtraction, multiplication, etc.) and
with more sophisticated operations (trigonometric functions, exponential
and logarithmic functions, etc.).
For unary operations like negation and trigonometric functions, these
ufuncs will preserve index and column labels in the output.
For binary operations such as addition and multiplication, Pandas will
automatically align indices when passing the objects to the ufunc.
The universal functions are working in series and DataFrames by
Index preservation
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 7
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Index alignment
Index Preservation
Pandas is designed to work with NumPy, any NumPy ufunc will work on
Pandas Series and DataFrame objects.
We can use all arithmetic and special universal functions as in NumPy on
pandas. In outputs the index will preserved (maintained) as shown below.
import pandas as pd
import numpy as np
ser = pd.Series([10,20,30,40])
df = pd.DataFrame(np.arange(1,13,1).reshape(3,4),columns=['A', 'B', 'C',
'D'])
print(df)
print(np.add(ser,5)) # the indices preserved for series
print(np.add(df,10)) # the indices preserved for DataFrame
Index Alignment in series
Pandas will align indices in the process of performing the operation. This
is very convenient when we are working with incomplete data, as we’ll.
suppose we are combining two different data sources, then the index will
aligned accordingly.
Exampe:
import numpy as np
import pandas as pd
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
print(A + B)
print(A.add(B)) #equivalent to A+B
print(A.add(B,fill_value=0)) #fill value for any elements in A or B that
might be missing
Index Alignment in DataFrame
A similar type of alignment takes place for both columns and indices when we
are performing operations on DataFrames.
Example:
import numpy as np
import pandas as pd
A = pd.DataFrame(np.arange(1,5,1).reshape(2,2),columns =list('AB'))
B = pd.DataFrame(np.arange(1,10,1).reshape(3,3),columns =list('BAC'))
print(A)
print(B)
print(A+B)
print(A.add(B,fill_value=0))
fill = A.stack().mean()
print(A.add(B,fill_value=fill))
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 8
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Output:
A B
0 1 2
1 3 4
B ... C
0 1 ... 3
1 4 ... 6
2 7 ... 9
[3 rows x 3 columns]
A ... C
0 3.0 ... NaN
1 8.0 ... NaN
2 NaN ... NaN
[3 rows x 3 columns]
A ... C
0 3.0 ... 3.0
1 8.0 ... 6.0
2 8.0 ... 9.0
[3 rows x 3 columns]
A ... C
0 3.0 ... 5.5
1 8.0 ... 8.5
2 10.5 ... 11.5
[3 rows x 3 columns]
Operations between DataFrame and Series
When we are performing operations between a DataFrame and a Series,
the index and column alignment is similarly maintained.
Operations between a DataFrame and a Series are similar to operations
between a two-dimensional and one-dimensional NumPy array.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([10,20])
df = pd.DataFrame([[100,200],[300,400]])
print(ser)
print(df)
print(df.subtract(ser))
print(df.subtract(ser,axis=0))
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 9
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Output:
0 10
1 20
0 1
0 100 200
1 300 400
0 1
0 90 180
1 290 380
0 1
0 90 190
1 280 380
Data Selection in DataFrame
DataFrame as a dictionary
Example1:
import pandas as pd
ser1 = pd.Series([10,20,30,40],index = ['row1','row2','row3','row4'])
ser2 = pd.Series([50,60,70,80],index = ['row1','row2','row3','row4'])
data = pd.DataFrame({'col1':ser1,'col2':ser2})
print(data)
print(data['col1']) # dict style
print(data.col1) # attribute style
data['sum'] = data['col1']+data['col2']
print(data)
Output:
col1 col2
row1 10 50
row2 20 60
row3 30 70
row4 40 80
row1 10
row2 20
row3 30
row4 40
row1 10
row2 20
row3 30
row4 40
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 10
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
[4 rows x 3 columns]
Example2:
import pandas as pd
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
ageslist = {'kumar':21,'Rao':22,'Ali':19,'Singh':20}
marks = pd.Series(markslist)
ages = pd.Series(ageslist)
data = pd.DataFrame({'marks': marks,'ages': ages})
print(data)
print(data['marks'])
print(data.marks)
data['ratio'] = data['marks'] / data['ages']
print(data)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
kumar 89
Rao 78
Ali 67
Singh 96
kumar 89
Rao 78
Ali 67
Singh 96
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 11
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
col1
row1 10
row2 20
row3 30
col1
row1 10
row2 20
row3 30
Example2:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 12
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
import pandas as pd
markslist = {'kumar':89,'Rao':78,'Ali':67,'Singh':96}
ageslist = {'kumar':21,'Rao':22,'Ali':19,'Singh':20}
marks = pd.Series(markslist)
ages = pd.Series(ageslist)
data = pd.DataFrame({'marks': marks,'ages': ages})
print(data)
print(data.values)
print(data.T)
Output:
marks ages
kumar 89 21
Rao 78 22
Ali 67 19
Singh 96 20
[[89 21]
[78 22]
[67 19]
[96 20]]
kumar ... Singh
marks 89 ... 96
ages 21 ... 20
[2 rows x 4 columns]
Handling Missing Data
A number of schemes have been developed to indicate the presence of
missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: using a mask that
globally indicates missing values, or choosing a sentinel value that
indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean
array, or it may involve appropriation of one bit in the data representation
to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific
convention, such as indicating a missing integer value with –9999 or
some rare bit pattern, or it could be a more global convention, such as
indicating a missing floating-point value with NaN (Not a Number), a
special value which is part of the IEEE floating-point specification.
Example:
import numpy as np
import pandas as pd
arr1 =np.array([1,2,3,4])
print(arr1)
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 13
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
print(arr1.sum())
arr2 =np.array([1,None,3,4])
print(arr2)
#print(arr2.sum())
arr3 =np.array([1,np.nan,3,4])
print(arr3)
print(arr3.sum())
print(np.nansum(arr3))
Output:
[1 2 3 4]
10
[1 None 3 4]
[ 1. nan 3. 4.]
nan
8.0
Missing Data in Pandas
The way in which Pandas handles missing values is constrained by its
NumPy package, which does not have a built-in notion of NA values for
non floating- point data types.
NumPy supports fourteen basic integer types once we account for
available precisions, signedness, and endianness of the encoding.
Reserving a specific bit pattern in all available NumPy types would lead
to an unwieldy amount of overhead in special-casing various operations
for various types, likely even requiring a new fork of the NumPy
package.
Pandas chose to use sentinels for missing data, and further chose to use
two already-existing Python null values: the special floatingpoint NaN
value, and the Python None object.
This choice has some side effects, as we will see, but in practice ends up
being a good compromise in most cases of interest.
None: Pythonic missing data
The first sentinel value used by Pandas is None, a Python singleton object
that is often used for missing data in Python code. Because None is a
Python object, it cannot be used in any arbitrary NumPy/Pandas array,
but only in arrays with data type 'object' (i.e., arrays of Python objects)
This dtype=object means that the best common type representation
NumPy could infer for the contents of the array is that they are Python
objects.
NaN: Missing numerical data
NaN is a special floating-point value recognized by all systems that use
the standard IEEE floating-point representation.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 14
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Output:
0 1.0
1 NaN
2 2.0
3 NaN
0 1
0 1.0 NaN
1 3.0 NaN
2 NaN 6.0
3 NaN 8.0
Operating on Null Values
There are several useful methods for detecting, removing, and replacing
null values in Pandas data structures.
They are:
isnull() - Generate a Boolean mask indicating missing values
notnull() - Opposite of isnull()
dropna() - Return a filtered version of the data
fillna() - Return a copy of the data with missing values filled or
imputed
Detecting null values
Pandas data structures have two useful methods for detecting null data: isnull()
and notnull().
Example:
import numpy as np
import pandas as pd
ser = pd.Series([1,np.nan,'hello',None])
df = pd.DataFrame([[np.nan,10,'hai'],[20,30,'wow']])
print(ser)
print(ser.isnull())
print(ser.notnull())
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 15
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
print(df)
print(df.isnull())
print(df.notnull())
0 1
1 NaN
2 hello
3 None
0 False
1 True
2 False
3 True
0 True
1 False
2 True
3 False
0 ... 2
0 NaN ... hai
1 20.0 ... wow
[2 rows x 3 columns]
0 ... 2
0 True ... False
1 False ... False
[2 rows x 3 columns]
0 ... 2
0 False ... True
1 True ... True
[2 rows x 3 columns]
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 16
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
print(ser)
print(df)
print(ser.dropna())
print(df.dropna())
print(df.dropna(axis =1))
print(df.dropna(axis ='columns')) #equivalent to axis =1
0 1
1 NaN
2 hello
3 None
0 ... 2
0 NaN ... hai
1 20.0 ... wow
[2 rows x 3 columns]
0 1
2 hello
0 ... 2
1 20.0 ... wow
[1 rows x 3 columns]
1 2
0 10 hai
1 30 wow
1 2
0 10 hai
1 30 wow
Example:
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.nan,10,'hai',None],[20,30,'wow',None]])
print(df)
print(df.dropna())
print(df.dropna(axis =1))
print(df.dropna(axis ='columns')) #equivalent to axis =1
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 17
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
print(df.dropna(axis ='columns',how='all'))
print(df.dropna(axis ='columns',thresh=2))
Output:
0 ... 3
0 NaN ... None
1 20.0 ... None
[2 rows x 4 columns]
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
1 2
0 10 hai
1 30 wow
1 2
0 10 hai
1 30 wow
0 ... 2
0 NaN ... hai
1 20.0 ... wow
[2 rows x 3 columns]
1 2
0 10 hai
1 30 wow
a 1.0
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 18
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
b NaN
c 2.0
d NaN
e 3.0
a 1.0
b 0.0
c 2.0
d 0.0
e 3.0
a 1.0
b 1.0
c 2.0
d 2.0
e 3.0
a 1.0
b 2.0
c 2.0
d 3.0
e 3.0
Filling null values in DataFrame
Example
import numpy as np
import pandas as pd
df = pd.DataFrame([[1, np.nan, 2,None],
[2, 3, 5, None],
[np.nan, 4, 6, None]])
print(df)
print(df.fillna(method='ffill', axis=1))
print(df.fillna(method='bfill', axis=1))
print(df.fillna(method='ffill', axis=0))
print(df.fillna(method='bfill', axis=0))
Output:
0 ... 3
0 1.0 ... None
1 2.0 ... None
2 NaN ... None
0 ... 3
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 19
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
0 ... 3
0 1.0 ... NaN
1 2.0 ... NaN
2 4.0 ... NaN
]
0 ... 3
0 1.0 ... None
1 2.0 ... None
2 2.0 ... None
0 ... 3
0 1.0 ... None
1 2.0 ... None
2 NaN ... None
Hierarchical Indexing
Hierarchical indexing (also known as multi-indexing) is used to
incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within
the familiar one-dimensional Series and two-dimensional DataFrame
objects.
A Multiply Indexed Series: Here we represent two-dimensional data
within a one-dimensional Series.
Example:
import numpy as np
import pandas as pd
ser = pd.Series([10,20,30,40,50,60],index = [[1,1,1,2,2,2,],
['a','b','c','a','b','c']])
print(ser)
ser.index.names = ['ind1','ind2']
print(ser)
Output:
1 a 10
b 20
c 30
2 a 40
b 50
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 20
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
c 60
ind1 ind2
1 a 10
b 20
c 30
2 a 40
b 50
c 60
A Multiply Indexed DataFrame:
Example:
import numpy as np
import pandas as pd
data = [[25,24],[28,26],[29,28],[27,26],[30,29],[28,27]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
['mid1','mid2','mid1','mid2','mid1','mid2']]
col = ['DS','DO']
df = pd.DataFrame(data,index=ind,columns=col)
print(df)
df.index.names =['rollNo','mid']
print(df)
Output:
DS DO
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
DS DO
rollNo mid
1201 mid1 25 24
mid2 28 26
1264 mid1 29 28
mid2 27 26
12C7 mid1 30 29
mid2 28 27
Example:
Python program to create following table of data
Dept Other
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 21
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
DS DO MOB EPC
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Program:
import numpy as np
import pandas as pd
data = [[25,24,23,15],[28,26,23,21],[29,28,27,26],[27,26,24,25],[30,29,28,27],
[28,27,25,26]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
['mid1','mid2','mid1','mid2','mid1','mid2']]
col = [['Dept','Dept','Other','Other'],['DS','DO','MOB','EPC']]
df = pd.DataFrame(data,index=ind,columns=col)
print(df.to_string())
Output:
Dept Other
DS DO MOB EPC
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Example:
Python program to create following table:
Type Dept Other
Sub DS DO MOB EPC
RollNo Mid
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Program:
import numpy as np
import pandas as pd
data = [[25,24,23,15],[28,26,23,21],[29,28,27,26],[27,26,24,25],[30,29,28,27],
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 22
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
[28,27,25,26]]
ind = [['1201','1201','1264','1264','12C7','12C7'],
['mid1','mid2','mid1','mid2','mid1','mid2']]
col = [['Dept','Dept','Other','Other'],['DS','DO','MOB','EPC']]
df = pd.DataFrame(data,index=ind,columns=col)
df.index.names =['RollNo','Mid']
df.columns.names =['Type','Sub']
print(df.to_string())
Output:
Type Dept Other
Sub DS DO MOB EPC
RollNo Mid
1201 mid1 25 24 23 15
mid2 28 26 23 21
1264 mid1 29 28 27 26
mid2 27 26 24 25
12C7 mid1 30 29 28 27
mid2 28 27 25 26
Combining Datasets
Some of the most interesting studies of data come from combining
different data sources.
These operations can involve anything from very straightforward
concatenation of two different datasets, to more complicated database-
style joins and merges that correctly handle any overlaps between the
dataset.
These operations can be:
simple concatenation of Series and DataFrames with the pd.concat
function
in-memory merges and joins implemented in Pandas.
Simple Concatenation with pd.concat
Pandas has a function, pd.concat(), which has a similar syntax to
np.concatenate but contains a number of other options
pd.concat() can be used for a simple concatenation of Series or DataFrame
objects, just as np.concatenate() can be used for simple concatenations of
arrays
import pandas as pd
import numpy as np
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
print(pd.concat([ser1, ser2]))
Output:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 23
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
1A
2B
3C
4D
5E
6F
Concatenation in data frame:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
df2 =pd.DataFrame([[50,60],[70,80]],index=[1,2],columns=['A','B'])
print(df1); print(df2); print(pd.concat([df1, df2]))
Output:
A B
1 10 20
2 30 40
A B
1 50 60
2 70 80
A B
1 10 20
2 30 40
1 50 60
2 70 80
By default, the concatenation takes place row-wise within the DataFrame
(i.e., axis=0). Like np.concatenate, pd.concat allows specification of an
axis along which concatenation will take place.
Example:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[10,20],[30,40]],index=[1,2],columns=['A','B'])
df2 =pd.DataFrame([[50,60],[70,80]],index=[1,2],columns=['A','B'])
print(df1); print(df2);
print(pd.concat([df1, df2],axis=1).to_string())
Output:
A B
1 10 20
2 30 40
C D
1 50 60
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 24
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
2 70 80
A B C D
1 10 20 50 60
2 30 40 70 80
By default, the entries for which no data is available are filled with NA
values. To change this, we can specify one of several options for the join
and join_axes parameters of the concatenate function. By default, the join
is a union of the input columns (join='outer'), but we can change this to
an intersection of the columns using join='inner':
Example:
import pandas as pd
import numpy as np
df1 =pd.DataFrame([[1,2,3],[4,5,6]],index=[1,2],columns=['A','B','C'])
df2
=pd.DataFrame([[7,8,9],[10,11,12]],index=[1,2],columns=['B','C','D'])
print(df1.to_string()); print(df2.to_string())
print(pd.concat([df1, df2]).to_string())
print(pd.concat([df1, df2],join='inner'))
Output:
A B C
1 1 2 3
2 4 5 6
B C D
1 7 8 9
2 10 11 12
A B C D
1 1.0 2 3 NaN
2 4.0 5 6 NaN
1 NaN 7 8 9.0
2 NaN 10 11 12.0
B C
1 2 3
2 5 6
1 7 8
2 10 11
The append() method
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 25
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 26
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 27
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
print(ser.mean())
Output:
150
30.0
For a DataFrame, by default the aggregates return results within each column.
By specifying the axis argument, we can instead aggregate within each row.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1,6),
'B':np.arange(10,60,10)})
print(df.sum())
print(df.mean())
print(df.sum(axis ='columns'))
print(df.mean(axis = 'columns'))
Output:
A 15
B 150
dtype: int64
A 3.0
B 30.0
dtype: float64
0 11
1 22
2 33
3 44
4 55
dtype: int64
0 5.5
1 11.0
2 16.5
3 22.0
4 27.5
dtype: float64
Pandas Series and DataFrames include all of the common aggregates .In
addition, there is a convenience method describe() that computes several
common aggregates for each column and returns the result.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':np.arange(1,6),
'B':np.arange(10,60,10)})
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 28
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
print(df.describe())
Output:
A B
count 5.000000 5.000000
mean 3.000000 30.000000
std 1.581139 15.811388
min 1.000000 10.000000
25% 2.000000 20.000000
50% 3.000000 30.000000
75% 4.000000 40.000000
max 5.000000 50.000000
df = pd.DataFrame({'key':['A','B','C','A','B','C'],
'data':np.arange(1,7)},columns=['key','data'])
print(df)
print(df.groupby('key').sum())
Output:
key data
0 A 1
1 B 2
2 C 3
3 A 4
4 B 5
5 C 6
data
key
A 5
B 7
C 9
Pivot Tables
A pivot table is a similar to GroupBy operation that is commonly seen in
spreadsheets and other programs that operate on tabular data.
The pivot table takes simple column wise data as input, and groups the
entries into a two-dimensional table that provides a multidimensional
summarization of the data.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 30
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Output:
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 31
4-2 B. Tech IT Regulation: R19 Data Science: UNIT-3 Part-2
Tutorial Questions:
1. Explain the fundamental data objects with its construction in pandas
2. Briefly explain the hierarchical indexing with examples
3. What is pivot table? Explain it clearly
4. Demonstrate data indexing and selection in Pandas Series and DataFrame objects.
5. Write short note on Operating on Data in Pandas
6. Demonstrate different methods of constructing MultiIndex.
7. How to handle missing data in pandas
8. Illustrate different approaches to combine data from multiple sources in pandas
9. Explore aggregation and grouping in Pandas
10. Briefly explore and demonstrate different methods for Operating on Null Values
Assignment Questions:
1. Write a python program to illustrate different ways of creating pandas Series
2. Write a python program to illustrate different ways of creating pandas DataFrame
3. Write a python program to illustrate detecting null values in pandas dataFrame
4. Write a python program to illustrate dropping null values in pandas DataFrame
5. Write a python program to illustrate filling null values in pandas DataFrame
6. Write a python program to illustrate creating different ways of pandas MutiIndex
7. Write a python program to illustrate indexing, slicing, Boolean indexing and fancy
indexing in MultiIndex.
8. Write a python program to illustrate merging two data sets with joins(inner, left and
right) in pandas
9. Write a python program to illustrate GroupBy operation of pandas.
10. Write a python program to illustrate pivot table in pandas.
Prepared By: MD SHAKEEL AHMED, Associate Professor, Dept. Of IT, VVIT, Guntur Page 32