Python GTU Study Material Presentations Unit-3 20112020032538AM
Python GTU Study Material Presentations Unit-3 20112020032538AM
Unit-03
Capturing, Preparing
and Working with data
We suppose to close the file once we are done using the file in the Python using close()
method.
closefile.py
1 f = open('college.txt')
2 data = f.read()
3 print(data)
4 f.close()
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 5
Handling errors using “with” keyword
It is possible that we may have typo in the filename or file we specified is moved/deleted, in
such cases there will be an error while running the file.
To handle such situations we can use new syntax of opening the file using with keyword.
fileusingwith.py
1 with open('college.txt') as f :
2 data = f.read()
3 print(data)
When we open file using with we need not to close the file.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 6
Example : Write file in Python
write() method will write the specified data to the file.
readdemo.py
1 with open('college.txt','a') as f :
2 f.write('Hello world')
If we open file with ‘w’ mode it will overwrite the data to the existing file or will create new file if
file does not exists.
If we open file with ‘a’ mode it will append the data at the end of the existing file or will create
new file if file does not exists.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 7
Reading CSV files without any library functions
A comma-separated values file is a delimited text file that uses a comma to separate values.
Each line of is a data record, Each record consists of many fields, separated by commas.
Example : Book1.csv readlines.py
studentname,enrollment,cpi 1 with open('Book1.csv') as f :
abcd,123456,8.5 2 rows = f.readlines()
rows = f.readlines()
bcde,456789,2.5 3 isFirstLine = True
for r in rows :
cdef,321654,7.6 4 for r in rows :
cols = r.split(',')
5 if isFirstLine :
print('Student Name = ', cols[0], end=" ")
6 isFirstLine = False
print('\tEn. No. = ', cols[1], end=" ")
We can use Microsoft Excel to access 7 continue
print('\tCPI = \t', cols[2])
8 cols = r.split(',')
CSV files. 9 print('Student Name = ', cols[0], end=" ")
10 print('\tEn. No. = ', cols[1], end=" ")
In the later sessions we will access CSV 11 print('\tCPI = \t', cols[2])
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 9
Python for Data Science (PDS) (3150713)
Unit-03.01
Lets Learn
NumPy
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 11
NumPy Array
The most important object defined in NumPy is an N-dimensional array type called ndarray.
It describes the collection of items of the same type, Items in the collection can be accessed
using a zero-based index.
An instance of ndarray class can be constructed in many different ways, the basic ndarray can
be created as below.
syntax
import numpy as np
a= np.array(list | tuple | set | dict)
numpyarray.py Output
1 import numpy as np <class 'numpy.ndarray'>
2 a= np.array(['darshan','Insitute','rajkot']) ['darshan' 'Insitute' 'rajkot']
3 print(type(a))
4 print(a)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 12
NumPy Array (Cont.)
arange(start,end,step) function will create NumPy array starting from start till end (not
included) with specified steps.
numpyarange.py Output
1 import numpy as np [0 1 2 3 4 5 6 7 8 9]
2 b = np.arange(0,10,1)
3 print(b)
zeros(n) function will return NumPy array of given shape, filled with zeros.
numpyzeros.py Output
1 import numpy as np [0. 0. 0.]
2 c = np.zeros(3)
3 print(c) [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
4 c1 = np.zeros((3,3)) #have to give as tuple
5 print(c1)
ones(n) function will return NumPy array of given shape, filled with ones.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 13
NumPy Array (Cont.)
eye(n) function will create 2-D NumPy array with ones on the diagonal and zeros elsewhere.
numpyeye.py Output
1 import numpy as np [[1. 0. 0.]
2 b = np.eye(3) [0. 1. 0.]
3 print(b) [0. 0. 1.]]
linspace(start,stop,num) function will return evenly spaced numbers over a specified interval.
numpylinspace.py Output
1 import numpy as np [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2 c = np.linspace(0,1,11) 0.9 1. ]
3 print(c)
Note: in arange function we have given start, stop & step, whereas in lispace function we are
giving start,stop & number of elements we want.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 14
Array Shape in NumPy
We can grab the shape of ndarray using its shape property.
numpyarange.py Output
1 import numpy as np (3,3)
2 b = np.zeros((3,3))
3 print(b.shape)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 15
NumPy Random
rand(p1,p2….,pn) function will create n-dimensional array with random data using uniform
distrubution, if we do not specify any parameter it will return random float number.
numpyrand.py Output
1 import numpy as np 0.23937253208490505
2 r1 = np.random.rand()
3 print(r1) [[0.58924723 0.09677878]
4 r2 = np.random.rand(3,2) # no tuple [0.97945337 0.76537675]
5 print(r2) [0.73097381 0.51277276]]
randint(low,high,num) function will create one-dimensional array with num random integer data
between low and high.
numpyrandint.py Output
1 import numpy as np [78 78 17 98 19 26 81 67 23 24]
2 r3 = np.random.randint(1,100,10)
3 print(r3)
We can reshape the array in any shape using reshape method, which we learned in previous
slide.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 16
NumPy Random (Cont.)
randn(p1,p2….,pn) function will create n-dimensional array with random data using standard
normal distribution, if we do not specify any parameter it will return random float number.
numpyrandn.py Output
1 import numpy as np -0.15359861758111037
2 r1 = np.random.randn()
3 print(r1) [[ 0.40967905 -0.21974532]
4 r2 = np.random.randn(3,2) # no tuple [-0.90341482 -0.69779498]
5 print(r2) [ 0.99444948 -1.45308348]]
Note: rand function will generate random number using uniform distribution, whereas randn
function will generate random number using standard normal distribution.
We are going to learn the difference using visualization technique (as a data scientist, We have
to use visualization techniques to convince the audience)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 17
Visualizing the difference between rand & randn
We are going to use matplotlib library to visualize the difference.
You need not to worry if you are not getting the syntax of matplotlib, we are going to learn it in detail in Unit-4
matplotdemo.py
1 import numpy as np
2 from matplotlib import pyplot as plt
3 %matplotlib inline
4 samplesize = 100000
5 uniform = np.random.rand(samplesize)
6 normal = np.random.randn(samplesize)
7 plt.hist(uniform,bins=100)
8 plt.title('rand: uniform')
9 plt.show()
10 plt.hist(normal,bins=100)
11 plt.title('randn: normal')
12 plt.show()
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 18
Aggregations
min() function will return the minimum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymin.py Output
1 import numpy as np Min way1 = 1
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Min way2 = 1
3 a = np.array(l)
4 print('Min way1 = ',a.min())
5 print('Min way2 = ',np.min(a))
max() function will return the maximum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymax.py Output
1 import numpy as np Max way1 = 11
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Max way2 = 11
3 a = np.array(l)
4 print('Max way1 = ',a.max())
5 print('Max way2 = ',np.max(a))
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 19
Aggregations (Cont.)
NumPy support many aggregation functions such as min, max, argmin, argmax, sum, mean, std,
etc…
numpymin.py Output
1 l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
2 a = np.array(l)
3 print('Min = ',a.min()) Min = 1
4 print('ArgMin = ',a.argmin()) ArgMin = 3
5 print('Max = ',a.max()) Max = 11
6 print('ArgMax = ',a.argmax()) ArgMax = 8
7 print('Sum = ',a.sum()) Sum = 122
8 print('Mean = ',a.mean()) Mean = 5.304347826086956
9 print('Std = ',a.std()) Std = 3.042235771223635
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 20
Using axis argument with aggregate functions
When we apply aggregate functions with multidimensional ndarray, it will apply aggregate
function to all its dimensions (axis).
numpyaxis.py Output
1 import numpy as np sum = 45
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
3 print('sum = ',array2d.sum())
If we want to get sum of rows or cols we can use axis argument with the aggregate functions.
numpyaxis.py Output
1 import numpy as np sum (cols) = [12 15 18]
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]]) sum (rows) = [6 15 24]
3 print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
4 print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 21
Single V/S Double bracket notations
There are two ways in which you can access element of multi-dimensional array, example of
both the method is given below
numpybrackets.py Output
1 arr = np.array([['a','b','c'],['d','e','f'], double = h
2 ['g','h','i']]) single = h
3 print('double = ',arr[2][1]) # double bracket notaion
4 print('single = ',arr[2,1]) # single bracket notation
Both method is valid and provides exactly the same answer, but single bracket notation is
recommended as in double bracket notation it will create a temporary sub array of third row and
then fetch the second column from it.
Single bracket notation will be easy to read and write while programming.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 22
Slicing ndarray
Slicing in python means taking elements from one given index to another given index.
Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.
Default start is 0
Default end is length of the array
Default step is 1
numpyslice1d.py Output
1 import numpy as np ['c' 'd' 'e']
2 arr = ['a' 'b' 'c' 'd' 'e']
np.array(['a','b','c','d','e','f','g','h']) ['f' 'g' 'h']
3 print(arr[2:5]) ['c' 'e' 'g']
4 print(arr[:5]) ['h' 'g' 'f' 'e' 'd' 'c'
5 print(arr[5:]) 'b' 'a']
6 print(arr[2:7:2])
7 print(arr[::-1])
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 23
Array Slicing Example
C-0 C-1 C-2 C-3 C-4
Example :
R-0 1 2 3 4 5 a[2][3] =
R-1 a[2,3] =
6 7 8 9 10
a[2] =
a = R-2
11 12 13 14 15 a[0:2] =
R-3
a[0:2:2] =
R-4 16 17 18 19 20
a[::-1] =
21 22 23 24 25
a[1:3,1:3] =
a[3:,:3] =
a[:,::-1] =
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 24
Slicing multi-dimensional array
Slicing multi-dimensional array would be same as single dimensional array with the help of
single bracket notation we learn earlier, lets see an example.
numpyslice1d.py Output
1 arr = np.array([['a','b','c'],['d','e','f'], [['a' 'b']
2 ['g','h','i']]) ['d' 'e']]
print(arr[0:2 , 0:2]) #first two rows and cols [['g' 'h' 'i']
3 print(arr[::-1]) #reversed rows ['d' 'e' 'f']
4 print(arr[: , ::-1]) #reversed cols ['a' 'b' 'c']]
5 print(arr[::-1,::-1]) #complete reverse [['c' 'b' 'a']
6 ['f' 'e' 'd']
['i' 'h' 'g']]
[['i' 'h' 'g']
['f' 'e' 'd']
['c' 'b' 'a']]
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 25
Warning : Array Slicing is mutable !
When we slice an array and apply some operation on them, it will also make changes in original
array, as it will not create a copy of a array while slicing.
Example,
numpyslice1d.py Output
1 import numpy as np Original Array = [2 2 2 4 5]
2 arr = np.array([1,2,3,4,5]) Sliced Array = [2 2 2]
3 arrsliced = arr[0:3]
4
5 arrsliced[:] = 2 # Broadcasting
6
7 print('Original Array = ', arr)
8 print('Sliced Array = ',arrsliced)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 26
NumPy Arithmetic Operations
numpyop.py Output
1 import numpy as np Addition Scalar = [[3 4 5]
2 arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]]) [3 4 5]
3 arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]]) [3 4 5]]
Addition Matrix = [[5 7 9]
4
[5 7 9]
5 arradd1 = arr1 + 2 # addition of matrix with scalar [5 7 9]]
6 arradd2 = arr1 + arr2 # addition of two matrices Substraction Scalar = [[-1 0 1]
7 print('Addition Scalar = ', arradd1) [-1 0 1]
8 print('Addition Matrix = ', arradd2) [-1 0 1]]
9 Substraction Matrix = [[-3 -3 -3]
10 arrsub1 = arr1 - 2 # substraction of matrix with [-3 -3 -3]
scalar [-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
11 arrsub2 = arr1 - arr2 # substraction of two matrices
[0.5 1. 1.5]
12 print('Substraction Scalar = ', arrsub1) [0.5 1. 1.5]]
13 print('Substraction Matrix = ', arrsub2) Division Matrix = [[0.25 0.4 0.5
14 arrdiv1 = arr1 / 2 # substraction of matrix with ]
scalar [0.25 0.4 0.5 ]
15 arrdiv2 = arr1 / arr2 # substraction of two matrices [0.25 0.4 0.5 ]]
16 print('Division Scalar = ', arrdiv1)
17 print('Division Matrix = ', arrdiv2)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 27
NumPy Arithmetic Operations (Cont.)
numpyop.py Output
1 import numpy as np Multiply Scalar = [[2 4 6]
2 arrmul1 = arr1 * 2 # multiply matrix with scalar [2 4 6]
3 arrmul2 = arr1 * arr2 # multiply two matrices [2 4 6]]
Multiply Matrix = [[ 4 10 18]
4 print('Multiply Scalar = ', arrmul1)
[ 4 10 18]
5 #Note : its not metrix multiplication* [ 4 10 18]]
6 print('Multiply Matrix = ', arrmul2) Matrix Multiplication = [[24 30
7 # In order to do matrix multiplication 36]
8 arrmatmul = np.matmul(arr1,arr2) [24 30 36]
9 print('Matrix Multiplication = ',arrmatmul) [24 30 36]]
10 # OR Dot = [[24 30 36]
arrdot = arr1.dot(arr2) [24 30 36]
[24 30 36]]
11 print('Dot = ',arrdot)
Python 3.5+ support = [[24 30 36]
12 # OR [24 30 36]
13 arrpy3dot5plus = arr1 @ arr2 [24 30 36]]
14 print('Python 3.5+ support = ',arrpy3dot5plus)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 28
Sorting Array
The sort() function returns a sorted copy of the input array.
syntax Parameters
import numpy as np arr = array to sort (inplace)
# arr = our ndarray axis = axis to sort (default=0)
np.sort(arr,axis,kind,order) kind = kind of algo to use
# OR arr.sort() (‘quicksort’ <- default,
‘mergesort’, ‘heapsort’)
order = on which field we want
to sort (if multiple fields)
Example :
numpysort.py Output
1 import numpy as np Before Sorting = ['Darshan'
2 arr = 'Rajkot' 'Insitute' 'of'
np.array(['Darshan','Rajkot','Insitute','of 'Engineering']
','Engineering']) After Sorting = ['Darshan'
3 print("Before Sorting = ", arr) 'Engineering' 'Insitute'
4 arr.sort() # or np.sort(arr) 'Rajkot' 'of']
5 print("After Sorting = ",arr)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 29
Sort Array Example
numpysort2.py Output
1 import numpy as np [(b'ABC', 300) (b'Darshan',
2 dt = np.dtype([('name', 'S10'),('age', int)]) 200) (b'XYZ', 100)]
3 arr2 = np.array([('Darshan',200),('ABC',300),
('XYZ',100)],dtype=dt)
arr2.sort(order='name')
4 print(arr2)
5
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 30
Conditional Selection
Similar to arithmetic operations when we apply any comparison operator to Numpy Array, then
it will be applied to each element in the array and a new bool Numpy Array will be created with
values True or False.
numpycond1.py Output
1 import numpy as np [25 17 24 15 17 97 42 10 67
2 arr = np.random.randint(1,100,10) 22]
3 print(arr) [False False False False
4 boolArr = arr > 50 False True False False True
5 print(boolArr) False]
numpycond2.py Output
1 import numpy as np All = [31 94 25 70 23 9 11
2 arr = np.random.randint(1,100,10) 77 48 11]
3 print("All = ",arr) Filtered = [94 70 77]
4 boolArr = arr > 50
5 print("Filtered = ", arr[boolArr])
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 31
Python for Data Science (PDS) (3150713)
Unit-03.02
Lets Learn
Pandas
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 33
Outline
Looping(Pandas)
Series
Data Frames
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Missing Data
Group By
Merging, Joining & Concatenating
Operations
Series
Series is an one-dimensional* array with axis labels.
It supports both integer and label-based index but index must be of hashable type.
If we do not specify index it will assign integer zero-based index.
syntax Parameters
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype,copy=False) index = array like index
dtype = data-type
copy = bool, default is False
pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 35
Series (Cont.)
We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 36
Series (Cont.)
We can specify index to Series with the help of index parameter
pdSeriesdtype.py Output
1 import numpy as np name darshan
2 import pandas as pd address rj
3 i = ['name','address','phone','email','website'] phone 123
4 d = email [email protected]
5 ['darshan','rj',123','[email protected]','darshan.ac.in'] website darshan.ac.in
6 s = pd.Series(data=d,index=i) dtype: object
print(s)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 37
Creating Time Series
We can use some of pandas inbuilt date functions to create a time series.
pdSeriesEle.py Output
1 import numpy as np 2020-07-27 50
2 import pandas as pd 2020-07-28 53
3 dates = pd.to_datetime("27th of July, 2020") 2020-07-29 25
4 i = dates + pd.to_timedelta(np.arange(5), 2020-07-30 70
unit='D') 2020-07-31 60
5 d = [50,53,25,70,60] dtype: int64
6 time_series = pd.Series(data=d,index=i)
7 print(time_series)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 38
Data Frames
Data frames are two dimensional data structure, i.e. data is aligned in a tabular format in rows
and columns.
Data frame also contains labelled axes on rows and columns.
Features of Data Frame :
It is size-mutable
Has labelled axes
Columns can be of different data types
We can perform arithmetic operations on rows and columns.
Structure :
PDS Algo SE INS
101
102
103
….
160
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 39
Data Frames (Cont.)
Syntax :
syntax Parameters
import pandas as pd data = array like Iterable
df = pd.DataFrame(data,index,columns,dtype,copy=False) index = array like row index
columns = array like col index
dtype = data-type
copy = bool, default is False
Example :
pdDataFrame.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 0 23 93 46
3 randArr = np.random.randint(0,100,20).reshape(5,4) 102 85 47 31 12
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 103 35 34 6 89
['PDS','Algo','SE','INS']) 104 66 83 70 50
print(df) 105 65 88 87 87
5
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 40
Data Frames (Cont.)
Grabbing the column
dfGrabCol.py Output
1 import numpy as np 101 0
2 import pandas as pd 102 85
3 randArr = np.random.randint(0,100,20).reshape(5,4) 103 35
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 104 66
['PDS','Algo','SE','INS']) 105 65
print(df['PDS']) Name: PDS, dtype: int32
5
Grabbing the multiple column
Output
dfGrabMulCol.py
PDS SE
1 print(df['PDS', 'SE']) 101 0 93
102 85 31
103 35 6
104 66 70
105 65 87
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 - Capturing, Preparing and Working with Data 41
Data Frames (Cont.)
Grabbing a row
dfGrabRow.py Output
dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('103',inplace=True) 102 85 47 31 12
2 print(df) 104 66 83 70 50
105 65 88 87 87
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 - Capturing, Preparing and Working with Data 42
Data Frames (Cont.)
Creating new column Output
dfCreateCol.py
PDS Algo SE INS total
101 0 23 93 46 162
1 df['total'] = df['PDS'] + df['Algo'] + 102 85 47 31 12 175
df['SE'] + df['INS'] 103 35 34 6 89 164
2 print(df) 104 66 83 70 50 269
105 65 88 87 87 327
Deleting Column and Row Output
dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('total',axis=1,inplace=True) 102 85 47 31 12
2 print(df) 103 35 34 6 89
104 66 83 70 50
105 65 88 87 87
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 43
Data Frames (Cont.)
Getting Subset of Data Frame
dfGrabSubSet.py Output
1 print(df.loc[[101,104], [['PDS','INS']]) PDS INS
101 0 46
104 66 50
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 44
Conditional Selection
Similar to NumPy we can do conditional selection in pandas.
dfCondSel.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 66 85 8 95
3 np.random.seed(121) 102 65 52 83 96
4 randArr = 103 46 34 52 60
np.random.randint(0,100,20).reshape(5,4) 104 54 3 94 52
5 df = 105 57 75 88 39
pd.DataFrame(randArr,np.arange(101,106,1) PDS Algo SE INS
,['PDS','Algo','SE','INS']) 101 True True False True
6 print(df) 102 True True True True
7 print(df>50) 103 False False True True
104 True False True True
105 True True True False
Note : we have used np.random.seed() method and set seed to be 121, so that when you
generate random number it matches with the random number I have generated.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 - Capturing, Preparing and Working with Data 45
Conditional Selection (Cont.)
We can then use this boolean DataFrame to get associated values.
dfCondSel.py Output
1 dfBool = df > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 NaN 95
102 65 52 83 96
Note : It will set NaN (Not a Number) in case of False 103 NaN NaN 52 60
104 54 NaN 94 52
105 57 75 88 NaN
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 46
Setting/Resetting index
In our previous example we have seen our index does not have name, if we want to specify
name to our index we can specify it using DataFrame.index.name property.
dfCondSel.py Output
1 df.index.name('RollNo') PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
Note: We have name to our
103 46 34 52 60
index now
104 54 3 94 52
105 57 75 88 39
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 47
Setting/Resetting index (Cont.)
set_index(new_index)
dfCondSel.py Output
1 df.set_index('PDS') #inplace=True Algo SE INS
PDS
66 85 8 95
65 52 83 96
Note: We have PDS as our
46 34 52 60
index now
54 3 94 52
reset_index() 57 75 88 39
dfCondSel.py Output
1 df.reset_index() RollNo PDS Algo SE INS
0 101 66 85 8 95
Note: Our RollNo(index) 1 102 65 52 83 96
become new column, and 2 103 46 34 52 60
we now have zero based 3 104 54 3 94 52
numeric index 4 105 57 75 88 39
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 48
Multi-Index DataFrame
Hierarchical indexes (AKA multiindexes) help us to organize, find, and aggregate information
faster at almost no cost.
Example where we need Hierarchical indexes
Numeric Index/Single Index Multi Index
Col Dep Sem RN S1 S2 S3 RN S1 S2 S3
0 ABC CE 5 101 50 60 70 Col Dep Sem
1 ABC CE 5 102 48 70 25 ABC CE 5 101 50 60 70
2 ABC CE 7 101 58 59 51 5 102 48 70 25
3 ABC ME 5 101 30 35 39 7 101 58 59 51
4 ABC ME 5 102 50 90 48 ME 5 101 30 35 39
5 Darshan CE 5 101 88 99 77 5 102 50 90 48
6 Darshan CE 5 102 99 84 76 Darshan CE 5 101 88 99 77
7 Darshan CE 7 101 88 77 99 5 102 99 84 76
8 Darshan ME 5 101 44 88 99 7 101 88 77 99
ME 5 101 44 88 99
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 49
Multi-Index DataFrame (Cont.)
Creating multiindexes is as simple as creating single index using set_index method, only
difference is in case of multiindexes we need to provide list of indexes instead of a single
string index, lets see and example for that
dfMultiIndex.py Output
1 dfMulti = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv') Col Dep Sem
2 dfMulti.set_index(['Col','Dep','Sem'], ABC CE 5 101 50 60 70
inplace=True) 5 102 48 70 25
3 print(dfMulti) 7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Darshan CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 50
Multi-Index DataFrame (Cont.)
Now we have multi-indexed DataFrame from which we can access data using multiple index
For Example Output (Darshan)
Sub DataFrame for all the students of Darshan RN S1 S2 S3
dfGrabDarshanStu.py Dep Sem
1 print(dfMulti.loc['Darshan']) CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Output (Darshan->CE)
RN S1 S2 S3
Sem
5 101 88 99 77
dfGrabDarshanCEStu.py 5 102 99 84 76
Sub DataFrame for Computer Engineering
7 101 88 77 99
1 print(dfMulti.loc['Darshan','CE'])
students from Darshan
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 51
Reading in Multiindexed DataFrame directly from CSV
read_csv function of pandas provides easy way to create multi-indexed DataFrame directly
while fetching the CSV file.
dfMultiIndex.py Output
1 dfMultiCSV = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv' Col Dep Sem
,index_col=[0,1,2]) ABC CE 5 101 50 60 70
#for multi-index in cols we can 5 102 48 70 25
use header parameter 7 101 58 59 51
2 print(dfMultiCSV) ME 5 101 30 35 39
5 102 50 90 48
Darshan CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 52
Cross Sections in DataFrame
The xs() function is used to get cross-section from the === Parameters ===
key : label
Series/DataFrame.
axis : Axis to retrieve
This method takes a key argument to select data at a cross section
particular level of a MultiIndex. level : level of key
drop_level : False if you
Syntax :
want to preserve the level
syntax
Output
DataFrame.xs(key, axis=0, level=None, drop_level=True)
Example : RN S2RN S3S1 S2
S1 S3
Col
Col Dep
Sem Sem
dfMultiIndex.py
ABC
ABC CE
5 5 101 101
50 50
60 60
70 70
1 dfMultiCSV = 5 5 102 102
48 48
70 70
25 25
pd.read_csv('MultiIndexDemo.csv', 7 7 101 101
58 58
59 59
51 51
index_col=[0,1,2]) Darshan ME
5 5 101 101
88 30
99 35
77 39
2 print(dfMultiCSV) 5 5 102 102
99 50
84 90
76 48
3 print(dfMultiCSV.xs('CE',axis=0,level='Dep') Darshan CE
7 5 101 101
88 88
77 99
99 77
) 5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 53
Dealing with Missing Data
There are many methods by which we can deal with the missing data, some of most commons
are listed below,
dropna, will drop (delete) the missing data (rows/cols)
fillna, will fill specified values in place of missing data
interpolate, will interpolate missing data and fill interpolated value in place of missing data.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 54
Groupby in Pandas
Any groupby operation involves one of the following
operations on the original object. They are College Enno CPI
Splitting the Object Darshan 123 8.9
Applying a function
Darshan 124 9.2
Combining the results
Darshan 125 7.8
In many situations, we split the data into sets and we Darshan 128 8.7 College Mean CPI
apply some functionality on each subset.
ABC 211 5.6 Darshan 8.65
we can perform the following operations ABC 212 6.2 ABC 4.8
Aggregation − computing a summary statistic
ABC 215 3.2 XYZ 5.83
Transformation − perform some group-specific operation
Filtration − discarding the data with some condition ABC 218 4.2
XYZ 312 5.2
Basic ways to use of groupby method
df.groupby('key') XYZ 315 6.5
df.groupby(['key1','key2']) XYZ 315 5.8
df.groupby(key,axis=1)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 55
Groupby in Pandas (Cont.)
Example : Listing all the groups
dfGroup.py Output
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 56
Groupby in Pandas (Cont.)
Example : Group by multiple columns
dfGroupMul.py Output
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 57
Output
Groupby in Pandas (Cont.) 2014
Team Rank Year Points
Example : Iterating through groups 0 Riders 1 2014 876
dfGroupIter.py 2 Devils 2 2014 863
4 Kings 3 2014 741
1 dfIPL = pd.read_csv('IPLDataSet.csv')
9 Royals 4 2014 701
2 groupIPL = dfIPL.groupby('Year')
2015
3 for name,group in groupIPL :
Team Rank Year Points
4 print(name)
1 Riders 2 2015 789
5 print(group)
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 58
Groupby in Pandas (Cont.)
Output
Example : Aggregating groups
YEAR_ID
dfGroupAgg.py 2003 1000
1 dfSales = pd.read_csv('SalesDataSet.csv') 2004 1345
2 print(dfSales.groupby(['YEAR_ID']).count( 2005 478
)['QUANTITYORDERED']) Name: QUANTITYORDERED, dtype:
3 print(dfSales.groupby(['YEAR_ID']).sum() int64
['QUANTITYORDERED']) YEAR_ID
4 print(dfSales.groupby(['YEAR_ID']).mean() 2003 34612
['QUANTITYORDERED']) 2004 46824
2005 17631
Name: QUANTITYORDERED, dtype:
int64
YEAR_ID
2003 34.612000
2004 34.813383
2005 36.884937
Name: QUANTITYORDERED, dtype:
float64
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 59
Groupby in Pandas (Cont.)
Output
Example : Describe details
count mean std min
dfGroupDesc.py 25% 50% 75% max
1 dfIPL = Year
pd.read_csv('IPLDataSet.csv') 2014 4.0 795.25 87.439026 701.0 731.0
2 print(dfIPL.groupby('Year').desc 802.0 866.25 876.0
ribe()['Points']) 2015 4.0 769.50 65.035888 673.0 760.0
796.5 806.00 812.0
2016 2.0 725.00 43.840620 694.0 709.5
725.0 740.50 756.0
2017 2.0 739.00 69.296465 690.0 714.5
739.0 763.50 788.0
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 60
Concatenation in Pandas
Concatenation basically glues together DataFrames.
Keep in mind that dimensions should match along the axis you are concatenating on.
You can use pd.concat and pass in a list of DataFrames to concatenate together:
dfConcat.py Output
1 dfCX = pd.read_csv('CX_Marks.csv',index_col=0) PDS Algo SE
2 dfCY = pd.read_csv('CY_Marks.csv',index_col=0) 101 50 55 60
3 dfCZ = pd.read_csv('CZ_Marks.csv',index_col=0) 102 70 80 61
4 dfAllStudent = pd.concat([dfCX,dfCY,dfCZ]) 103 55 89 70
5 print(dfAllStudent) 104 58 96 85
201 77 96 63
202 44 78 32
Note : We can use axis=1 parameter to concat columns. 203 55 85 21
204 69 66 54
301 11 75 88
302 22 48 77
303 33 59 68
304 44 55 62
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 61
Join in Pandas
df.join() method will efficiently join multiple DataFrame objects by index(or column
specified) .
some of important Parameters :
dfOther : Right Data Frame
on (Not recommended) : specify the column on which we want to join (Default is index)
how : How to handle the operation of the two objects.
left: use calling frame’s index (Default).
right: use dfOther index.
outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it.
lexicographically.
inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order
of the calling’s one.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 62
Join in Pandas (Example)
dfJoin.py Output - 1 Output - 2
1 dfINS = PDS Algo SE INS PDS Algo SE INS
pd.read_csv('INS_Marks.csv',index_col=0) 101 50 55 60 55.0 301 11 75 88 11
2 dfLeftJoin = allStudent.join(dfINS) 102 70 80 61 66.0 302 22 48 77 22
3 print(dfLeftJoin) 103 55 89 70 77.0 303 33 59 68 33
4 dfRightJoin = 104 58 96 85 88.0 304 44 55 62 44
allStudent.join(dfINS,how='right') 201 77 96 63 66.0 101 50 55 60 55
5 print(dfRightJoin) 202 44 78 32 NaN 102 70 80 61 66
203 55 85 21 78.0 103 55 89 70 77
204 69 66 54 85.0 104 58 96 85 88
301 11 75 88 11.0 201 77 96 63 66
302 22 48 77 22.0 203 55 85 21 78
303 33 59 68 33.0 204 69 66 54 85
304 44 55 62 44.0
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 63
Merge in Pandas
Merge DataFrame or named Series objects with a database-style join.
Similar to join method, but used when we want to join/merge with the columns instead of index.
some of important Parameters :
dfOther : Right Data Frame
on : specify the column on which we want to join (Default is index)
left_on : specify the column of left Dataframe
right_on : specify the column of right Dataframe
how : How to handle the operation of the two objects.
left: use calling frame’s index (Default).
right: use dfOther index.
outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it.
lexicographically.
inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order
of the calling’s one.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 64
Merge in Pandas (Example)
dfMerge.py Output
1 m1 = pd.read_csv('Merge1.csv') RollNo EnNo Name
2 print(m1) 0 101 11112222 Abc
3 m2 = pd.read_csv('Merge2.csv') 1 102 11113333 Xyz
4 print(m2) 2 103 22224444 Def
5 m3 = m1.merge(m2,on='EnNo')
6 print(m3) EnNo PDS INS
0 11112222 50 60
1 11113333 60 70
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 65
Read CSV in Pandas
read_csv() is used to read Comma Separated Values (CSV) file into a pandas DataFrame.
some of important Parameters :
filePath : str, path object, or file-like object
sep : separator (Default is comma)
header: Row number(s) to use as the column names.
index_col : index column(s) of the data frame.
readCSV.py Output
1 dfINS = pd.read_csv('Marks.csv',index_col=0,header=0) PDS Algo SE INS
2 print(dfINS) 101 50 55 60 55.0
102 70 80 61 66.0
103 55 89 70 77.0
104 58 96 85 88.0
201 77 96 63 66.0
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 66
Read Excel in Pandas
Read an Excel file into a pandas DataFrame.
Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or
URL. Supports an option to read a single sheet or a list of sheets.
some of important Parameters :
excelFile : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
sheet_name : sheet no in integer or the name of the sheet, can have list of sheets.
index_col : index column of the data frame.
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 67
Read from MySQL Database
We need two libraries for that,
conda install sqlalchemy
conda install pymysql
After installing both the libraries, import create_engine from sqlalchemy and import
pymysql
importsForDB.py
1 from sqlalchemy import create_engine
2 import pymysql
Then, create a database connection string and create engine using it.
createEngine.py
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 68
Read from MySQL Database (Cont.)
After getting the engine, we can fire any sql query using pd.read_sql method.
read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL,
Oracle etc…)
readSQLDemo.py
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)
Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 69
Web Scrapping using Beautiful Soup
Beautiful Soup is a library that makes it easy to scrape information from web pages.
It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and
modifying the parse tree. Output
webScrap.py Dr. Gopi Sanghani
1 import requests Dr. Nilesh Gambhava
2 import bs4 Dr. Pradyumansinh
3 req = Jadeja
requests.get('https://www.darshan.ac.in/DIET/CE/Faculty') Prof. Hardik Doshi
4 soup = bs4.BeautifulSoup(req.text,'lxml') Prof. Maulik Trivedi
5 allFaculty = soup.select('body > main > section:nth- Prof. Dixita Kagathara
child(5) > div > div > div.col-lg-8.col-xl-9 > div > Prof. Firoz Sherasiya
div') Prof. Rupesh Vaishnav
6 for fac in allFaculty : Prof. Swati Sharma
7 allSpans = fac.select('h2>a') Prof. Arjun Bala
8 print(allSpans[0].text.strip()) Prof. Mayur Padia
…..
…..
Prof. Arjun V. Bala #3150713 (PDS) Unit 03 – Capturing, Preparing and Working with Data 70