100% found this document useful (1 vote)
57 views70 pages

Python GTU Study Material Presentations Unit-3 20112020032538AM

This document provides an overview of capturing, preparing, and working with data in Python. It discusses basic file input/output operations in Python using open(), read(), write(), and close() methods. It also compares NumPy and Pandas, indicating that NumPy is best for performance while Pandas provides richer functionality and easier coding. The document then focuses on NumPy, introducing the Python library for scientific computing with multi-dimensional arrays and matrices.

Uploaded by

Kushal Parmar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
57 views70 pages

Python GTU Study Material Presentations Unit-3 20112020032538AM

This document provides an overview of capturing, preparing, and working with data in Python. It discusses basic file input/output operations in Python using open(), read(), write(), and close() methods. It also compares NumPy and Pandas, indicating that NumPy is best for performance while Pandas provides richer functionality and easier coding. The document then focuses on NumPy, introducing the Python library for scientific computing with multi-dimensional arrays and matrices.

Uploaded by

Kushal Parmar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Python for Data Science (PDS) (3150713)

Unit-03
Capturing, Preparing
and Working with data

Prof. Arjun V. Bala


Computer Engineering Department
Darshan Institute of Engineering & Technology, Rajkot
[email protected]
9624822202
 Outline
Looping

Basic File IO in Python


NumPy V/S Pandas (what to use?)
NumPy
Pandas
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Web Scrapping using BeautifulSoup
Basic IO operations in Python
 Before we can read or write a file, we have to open it using Python's built-in open() function.
syntax
fileobject = open(filename [, accessmode][, buffering])

 filename is a name of a file we want to open.


 accessmode is determines the mode in which file has to be opened (list of possible values given below)
 If buffering is set to 0, no buffering will happen, if set to 1 line buffering will happen, if grater than 1 then the
number of buffer and if negative is given it will follow system default buffering behaviour.

M Description M Description (create file if not exist) M Description


r Read only (default) w Write only Opens file to append, if file not
a
exist will create it for write
rb Read only in binary format wb Write only in binary format
Append in binary format, if file
r+ Read and Write both w+ Read and Write both ab
not exist will create it for write
Read and Write both in Read and Write both in Append, if file not exist it will
rb+ wb+ a+
binary format binary format create for read & write both
Read and Write both in binary
ab+
format
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 3
Example : Read file in Python
 read(size) will read specified bytes from the file, if we don’t specify size it will return whole file.
readfile.py college.txt
1 f = open('college.txt') Darshan Institute of Engineering and Technology - Rajkot
2 data = f.read() At Hadala, Rajkot - Morbi Highway,
3 print(data) Gujarat-363650, INDIA

 reallines() method will return list of lines from the file.


readlines.py OUTPUT
1 f = open('college.txt') ['Darshan Institute of Engineering and Technology -
2 lines = f.readlines() Rajkot\n', 'At Hadala, Rajkot - Morbi Highway,\n',
3 print(lines) 'Gujarat-363650, INDIA']

 We can use for loop to get each line separately,


readlinesfor.py OUTPUT
1 f = open('college.txt') Darshan Institute of Engineering and Technology - Rajkot
2 lines = f.readlines()
3 for l in lines : At Hadala, Rajkot - Morbi Highway,
4     print(l)
Gujarat-363650, INDIA
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 4
How to write path?
 We can specify relative path in argument to open method, alternatively we can also specify
absolute path.
 To specify absolute path,
 In windows, f=open(‘D:\\folder\\subfolder\\filename.txt’)
 In mac & linux, f=open(‘/user/folder/subfolder/filename.txt’)

 We suppose to close the file once we are done using the file in the Python using close()
method.
closefile.py
1 f = open('college.txt')
2 data = f.read()
3 print(data)
4 f.close()

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 5
Handling errors using “with” keyword
 It is possible that we may have typo in the filename or file we specified is moved/deleted, in
such cases there will be an error while running the file.
 To handle such situations we can use new syntax of opening the file using with keyword.
fileusingwith.py
1 with open('college.txt') as f :
2     data = f.read()
3     print(data)

 When we open file using with we need not to close the file.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 6
Example : Write file in Python
 write() method will write the specified data to the file.
readdemo.py
1 with open('college.txt','a') as f :
2     f.write('Hello world')

 If we open file with ‘w’ mode it will overwrite the data to the existing file or will create new file if
file does not exists.
 If we open file with ‘a’ mode it will append the data at the end of the existing file or will create
new file if file does not exists.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 7
Reading CSV files without any library functions
 A comma-separated values file is a delimited text file that uses a comma to separate values.
 Each line of is a data record, Each record consists of many fields, separated by commas.
 Example : Book1.csv readlines.py
studentname,enrollment,cpi 1 with open('Book1.csv') as f :
abcd,123456,8.5 2   rows = f.readlines()
rows = f.readlines()
bcde,456789,2.5 3   isFirstLine = True
  for r in rows :
cdef,321654,7.6 4   for r in rows :
    cols = r.split(',')
5     if isFirstLine :
    print('Student Name = ', cols[0], end=" ")
6         isFirstLine = False
    print('\tEn. No. = ', cols[1], end=" ")
 We can use Microsoft Excel to access 7         continue
    print('\tCPI = \t', cols[2])
8     cols = r.split(',')
CSV files. 9     print('Student Name = ', cols[0], end=" ")
10     print('\tEn. No. = ', cols[1], end=" ")
 In the later sessions we will access CSV 11     print('\tCPI = \t', cols[2])

files using different libraries, but we can


also access CSV files without any libraries.
(Not recommend)
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 8
NumPy v/s Pandas
 Developers built pandas on top of NumPy, as a result every task we perform using pandas also
goes through NumPy.
 To obtain the benefits of pandas, we need to pay a performance penalty that some testers say
is 100 times slower than NumPy for similar task.
 Nowadays computer hardware are powerful enough to take care for the performance issue, but
when speed of execution is essential NumPy is always the best choice.
 We can use pandas to make writing code easier and faster, pandas will reduce potential coding
errors.
 Pandas provide rich time-series functionality, data alignment, NA-friendly statistics, groupby,
merge, etc.. methods, if we use NumPy we have to implement all these methods manually.
 So,
 if we want performance we should use NumPy,
 if we want ease of coding we should use pandas.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 9
Python for Data Science (PDS) (3150713)

Unit-03.01
Lets Learn
NumPy

Prof. Arjun V. Bala


Computer Engineering Department
Darshan Institute of Engineering & Technology, Rajkot
[email protected]
9624822202
NumPy
 NumPy (Numeric Python) is a Python library to manipulate arrays.
 Almost all the libraries in python rely on NumPy as one of their main building block.
 NumPy provides functions for domains like Algebra, Fourier transform etc..
 NumPy is incredibly fast as it has bindings to C libraries.
 Install :
 conda install numpy
OR  pip install numpy

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 11
NumPy Array
 The most important object defined in NumPy is an N-dimensional array type called ndarray.
 It describes the collection of items of the same type, Items in the collection can be accessed
using a zero-based index.
 An instance of ndarray class can be constructed in many different ways, the basic ndarray can
be created as below.
syntax
import numpy as np
a= np.array(list | tuple | set | dict)

numpyarray.py Output
1 import numpy as np <class 'numpy.ndarray'>
2 a= np.array(['darshan','Insitute','rajkot']) ['darshan' 'Insitute' 'rajkot']
3 print(type(a))
4 print(a)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 12
NumPy Array (Cont.)
 arange(start,end,step) function will create NumPy array starting from start till end (not
included) with specified steps.
numpyarange.py Output
1 import numpy as np [0 1 2 3 4 5 6 7 8 9]
2 b = np.arange(0,10,1)
3 print(b)

 zeros(n) function will return NumPy array of given shape, filled with zeros.
numpyzeros.py Output
1 import numpy as np [0. 0. 0.]
2 c = np.zeros(3)
3 print(c) [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
4 c1 = np.zeros((3,3)) #have to give as tuple
5 print(c1)

 ones(n) function will return NumPy array of given shape, filled with ones.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 13
NumPy Array (Cont.)
 eye(n) function will create 2-D NumPy array with ones on the diagonal and zeros elsewhere.
numpyeye.py Output
1 import numpy as np [[1. 0. 0.]
2 b = np.eye(3) [0. 1. 0.]
3 print(b) [0. 0. 1.]]

 linspace(start,stop,num) function will return evenly spaced numbers over a specified interval.
numpylinspace.py Output
1 import numpy as np [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2 c = np.linspace(0,1,11) 0.9 1. ]
3 print(c)

 Note: in arange function we have given start, stop & step, whereas in lispace function we are
giving start,stop & number of elements we want.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 14
Array Shape in NumPy
 We can grab the shape of ndarray using its shape property.
numpyarange.py Output
1 import numpy as np (3,3)
2 b = np.zeros((3,3))
3 print(b.shape)

 We can also reshape the array using reshape method of ndarray.


numpyarange.py Output
1 import numpy as np [[29 55]
2 re1 = np.random.randint(1,100,10) [44 50]
3 re2 = re1.reshape(5,2) [25 53]
4 print(re2) [59 6]
[93 7]]
 Note: the number of elements and multiplication of rows and cols in new array must be equal.
 Example : here we have old one-dimensional array of 10 elements and reshaped shape is (5,2)
so, 5 * 2 = 10, which means it is a valid reshape

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 15
NumPy Random
 rand(p1,p2….,pn) function will create n-dimensional array with random data using uniform
distrubution, if we do not specify any parameter it will return random float number.
numpyrand.py Output
1 import numpy as np 0.23937253208490505
2 r1 = np.random.rand()
3 print(r1) [[0.58924723 0.09677878]
4 r2 = np.random.rand(3,2) # no tuple [0.97945337 0.76537675]
5 print(r2) [0.73097381 0.51277276]]
 randint(low,high,num) function will create one-dimensional array with num random integer data
between low and high.
numpyrandint.py Output
1 import numpy as np [78 78 17 98 19 26 81 67 23 24]
2 r3 = np.random.randint(1,100,10)
3 print(r3)

 We can reshape the array in any shape using reshape method, which we learned in previous
slide.
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 16
NumPy Random (Cont.)
 randn(p1,p2….,pn) function will create n-dimensional array with random data using standard
normal distribution, if we do not specify any parameter it will return random float number.
numpyrandn.py Output
1 import numpy as np -0.15359861758111037
2 r1 = np.random.randn()
3 print(r1) [[ 0.40967905 -0.21974532]
4 r2 = np.random.randn(3,2) # no tuple [-0.90341482 -0.69779498]
5 print(r2) [ 0.99444948 -1.45308348]]

 Note: rand function will generate random number using uniform distribution, whereas randn
function will generate random number using standard normal distribution.
 We are going to learn the difference using visualization technique (as a data scientist, We have
to use visualization techniques to convince the audience)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 17
Visualizing the difference between rand & randn
 We are going to use matplotlib library to visualize the difference.
 You need not to worry if you are not getting the syntax of matplotlib, we are going to learn it in detail in Unit-4
matplotdemo.py
1 import numpy as np
2 from matplotlib import pyplot as plt
3 %matplotlib inline
4 samplesize = 100000
5 uniform = np.random.rand(samplesize)
6 normal = np.random.randn(samplesize)
7 plt.hist(uniform,bins=100)
8 plt.title('rand: uniform')
9 plt.show()
10 plt.hist(normal,bins=100)
11 plt.title('randn: normal')
12 plt.show()

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 18
Aggregations
 min() function will return the minimum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymin.py Output
1 import numpy as np Min way1 = 1
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Min way2 = 1
3 a = np.array(l)
4 print('Min way1 = ',a.min())
5 print('Min way2 = ',np.min(a))
 max() function will return the maximum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymax.py Output
1 import numpy as np Max way1 = 11
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Max way2 = 11
3 a = np.array(l)
4 print('Max way1 = ',a.max())
5 print('Max way2 = ',np.max(a))

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 19
Aggregations (Cont.)
 NumPy support many aggregation functions such as min, max, argmin, argmax, sum, mean, std,
etc…
numpymin.py Output
1 l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
2 a = np.array(l)
3 print('Min = ',a.min()) Min = 1
4 print('ArgMin = ',a.argmin()) ArgMin = 3
5 print('Max = ',a.max()) Max = 11
6 print('ArgMax = ',a.argmax()) ArgMax = 8
7 print('Sum = ',a.sum()) Sum = 122
8 print('Mean = ',a.mean()) Mean = 5.304347826086956
9 print('Std = ',a.std()) Std = 3.042235771223635

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 20
Using axis argument with aggregate functions
 When we apply aggregate functions with multidimensional ndarray, it will apply aggregate
function to all its dimensions (axis).
numpyaxis.py Output
1 import numpy as np sum = 45
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
3 print('sum = ',array2d.sum())

 If we want to get sum of rows or cols we can use axis argument with the aggregate functions.
numpyaxis.py Output
1 import numpy as np sum (cols) = [12 15 18]
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]]) sum (rows) = [6 15 24]
3 print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
4 print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 21
Single V/S Double bracket notations
 There are two ways in which you can access element of multi-dimensional array, example of
both the method is given below
numpybrackets.py Output
1 arr = np.array([['a','b','c'],['d','e','f'], double = h
2 ['g','h','i']]) single = h
3 print('double = ',arr[2][1]) # double bracket notaion
4 print('single = ',arr[2,1]) # single bracket notation
 Both method is valid and provides exactly the same answer, but single bracket notation is
recommended as in double bracket notation it will create a temporary sub array of third row and
then fetch the second column from it.
 Single bracket notation will be easy to read and write while programming.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 22
Slicing ndarray
 Slicing in python means taking elements from one given index to another given index.
 Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.
 Default start is 0
 Default end is length of the array
 Default step is 1

numpyslice1d.py Output
1 import numpy as np ['c' 'd' 'e']
2 arr = ['a' 'b' 'c' 'd' 'e']
np.array(['a','b','c','d','e','f','g','h']) ['f' 'g' 'h']
3 print(arr[2:5]) ['c' 'e' 'g']
4 print(arr[:5]) ['h' 'g' 'f' 'e' 'd' 'c'
5 print(arr[5:]) 'b' 'a']
6 print(arr[2:7:2])
7 print(arr[::-1])

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 23
Array Slicing Example
C-0 C-1 C-2 C-3 C-4
 Example :
R-0 1 2 3 4 5  a[2][3] =
R-1  a[2,3] =
6 7 8 9 10
 a[2] =
a = R-2
11 12 13 14 15  a[0:2] =
R-3
 a[0:2:2] =
R-4 16 17 18 19 20
 a[::-1] =
21 22 23 24 25
 a[1:3,1:3] =
 a[3:,:3] =
 a[:,::-1] =

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 24
Slicing multi-dimensional array
 Slicing multi-dimensional array would be same as single dimensional array with the help of
single bracket notation we learn earlier, lets see an example.
numpyslice1d.py Output
1 arr = np.array([['a','b','c'],['d','e','f'], [['a' 'b']
2 ['g','h','i']]) ['d' 'e']]
print(arr[0:2 , 0:2]) #first two rows and cols [['g' 'h' 'i']
3 print(arr[::-1]) #reversed rows ['d' 'e' 'f']
4 print(arr[: , ::-1]) #reversed cols ['a' 'b' 'c']]
5 print(arr[::-1,::-1]) #complete reverse [['c' 'b' 'a']
6 ['f' 'e' 'd']
['i' 'h' 'g']]
[['i' 'h' 'g']
['f' 'e' 'd']
['c' 'b' 'a']]

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 25
Warning : Array Slicing is mutable !
 When we slice an array and apply some operation on them, it will also make changes in original
array, as it will not create a copy of a array while slicing.
 Example,
numpyslice1d.py Output
1 import numpy as np Original Array = [2 2 2 4 5]
2 arr = np.array([1,2,3,4,5]) Sliced Array = [2 2 2]
3 arrsliced = arr[0:3]
4
5 arrsliced[:] = 2 # Broadcasting
6
7 print('Original Array = ', arr)
8 print('Sliced Array = ',arrsliced)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 26
NumPy Arithmetic Operations
numpyop.py Output
1 import numpy as np Addition Scalar = [[3 4 5]
2 arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]]) [3 4 5]
3 arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]]) [3 4 5]]
Addition Matrix = [[5 7 9]
4
[5 7 9]
5 arradd1 = arr1 + 2 # addition of matrix with scalar [5 7 9]]
6 arradd2 = arr1 + arr2 # addition of two matrices Substraction Scalar = [[-1 0 1]
7 print('Addition Scalar = ', arradd1) [-1 0 1]
8 print('Addition Matrix = ', arradd2) [-1 0 1]]
9 Substraction Matrix = [[-3 -3 -3]
10 arrsub1 = arr1 - 2 # substraction of matrix with [-3 -3 -3]
scalar [-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
11 arrsub2 = arr1 - arr2 # substraction of two matrices
[0.5 1. 1.5]
12 print('Substraction Scalar = ', arrsub1) [0.5 1. 1.5]]
13 print('Substraction Matrix = ', arrsub2) Division Matrix = [[0.25 0.4 0.5
14 arrdiv1 = arr1 / 2 # substraction of matrix with ]
scalar [0.25 0.4 0.5 ]
15 arrdiv2 = arr1 / arr2 # substraction of two matrices [0.25 0.4 0.5 ]]
16 print('Division Scalar = ', arrdiv1)
17 print('Division Matrix = ', arrdiv2)
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 27
NumPy Arithmetic Operations (Cont.)
numpyop.py Output
1 import numpy as np Multiply Scalar = [[2 4 6]
2 arrmul1 = arr1 * 2 # multiply matrix with scalar [2 4 6]
3 arrmul2 = arr1 * arr2 # multiply two matrices [2 4 6]]
Multiply Matrix = [[ 4 10 18]
4 print('Multiply Scalar = ', arrmul1)
[ 4 10 18]
5 #Note : its not metrix multiplication* [ 4 10 18]]
6 print('Multiply Matrix = ', arrmul2) Matrix Multiplication = [[24 30
7 # In order to do matrix multiplication 36]
8 arrmatmul = np.matmul(arr1,arr2) [24 30 36]
9 print('Matrix Multiplication = ',arrmatmul) [24 30 36]]
10 # OR Dot = [[24 30 36]
arrdot = arr1.dot(arr2) [24 30 36]
[24 30 36]]
11 print('Dot = ',arrdot)
Python 3.5+ support = [[24 30 36]
12 # OR [24 30 36]
13 arrpy3dot5plus = arr1 @ arr2 [24 30 36]]
14 print('Python 3.5+ support = ',arrpy3dot5plus)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 28
Sorting Array
 The sort() function returns a sorted copy of the input array.
syntax Parameters
import numpy as np arr = array to sort (inplace)
# arr = our ndarray axis = axis to sort (default=0)
np.sort(arr,axis,kind,order) kind = kind of algo to use
# OR arr.sort() (‘quicksort’ <- default,
‘mergesort’, ‘heapsort’)
order = on which field we want
to sort (if multiple fields)
 Example :
numpysort.py Output
1 import numpy as np Before Sorting = ['Darshan'
2 arr = 'Rajkot' 'Insitute' 'of'
np.array(['Darshan','Rajkot','Insitute','of 'Engineering']
','Engineering']) After Sorting = ['Darshan'
3 print("Before Sorting = ", arr) 'Engineering' 'Insitute'
4 arr.sort() # or np.sort(arr) 'Rajkot' 'of']
5 print("After Sorting = ",arr)
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 29
Sort Array Example
numpysort2.py Output
1 import numpy as np [(b'ABC', 300) (b'Darshan',
2 dt = np.dtype([('name', 'S10'),('age', int)]) 200) (b'XYZ', 100)]
3 arr2 = np.array([('Darshan',200),('ABC',300),
('XYZ',100)],dtype=dt)
arr2.sort(order='name')
4 print(arr2)
5

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 30
Conditional Selection
 Similar to arithmetic operations when we apply any comparison operator to Numpy Array, then
it will be applied to each element in the array and a new bool Numpy Array will be created with
values True or False.
numpycond1.py Output
1 import numpy as np [25 17 24 15 17 97 42 10 67
2 arr = np.random.randint(1,100,10) 22]
3 print(arr) [False False False False
4 boolArr = arr > 50 False True False False True
5 print(boolArr) False]
numpycond2.py Output
1 import numpy as np All = [31 94 25 70 23 9 11
2 arr = np.random.randint(1,100,10) 77 48 11]
3 print("All = ",arr) Filtered = [94 70 77]
4 boolArr = arr > 50
5 print("Filtered = ", arr[boolArr])

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 31
Python for Data Science (PDS) (3150713)

Unit-03.02
Lets Learn
Pandas

Prof. Arjun V. Bala


Computer Engineering Department
Darshan Institute of Engineering & Technology, Rajkot
[email protected]
9624822202
Pandas
 Pandas is an open source library built on top of NumPy.
 It allows for fast data cleaning, preparation and analysis.
 It excels in performance and productivity.
 It also has built-in visualization features.
 It can work with the data from wide variety of sources.
 Install :
 conda install pandas
OR  pip install pandas

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 33
 Outline
Looping(Pandas)

Series
Data Frames
Accessing text, CSV, Excel files using pandas
Accessing SQL Database
Missing Data
Group By
Merging, Joining & Concatenating
Operations
Series
 Series is an one-dimensional* array with axis labels.
 It supports both integer and label-based index but index must be of hashable type.
 If we do not specify index it will assign integer zero-based index.
syntax Parameters
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype,copy=False) index = array like index
dtype = data-type
copy = bool, default is False

pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 35
Series (Cont.)
 We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)

 We can specify the data type of Series using dtype parameter


pdSeriesdtype.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11], dtype='str') Sum = 13
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 36
Series (Cont.)
 We can specify index to Series with the help of index parameter
pdSeriesdtype.py Output
1 import numpy as np name darshan
2 import pandas as pd address rj
3 i = ['name','address','phone','email','website'] phone 123
4 d = email [email protected]
5 ['darshan','rj',123','[email protected]','darshan.ac.in'] website darshan.ac.in
6 s = pd.Series(data=d,index=i) dtype: object
print(s)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 37
Creating Time Series
 We can use some of pandas inbuilt date functions to create a time series.
pdSeriesEle.py Output
1 import numpy as np 2020-07-27 50
2 import pandas as pd 2020-07-28 53
3 dates = pd.to_datetime("27th of July, 2020") 2020-07-29 25
4 i = dates + pd.to_timedelta(np.arange(5), 2020-07-30 70
unit='D') 2020-07-31 60
5 d = [50,53,25,70,60] dtype: int64
6 time_series = pd.Series(data=d,index=i)
7 print(time_series)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 38
Data Frames
 Data frames are two dimensional data structure, i.e. data is aligned in a tabular format in rows
and columns.
 Data frame also contains labelled axes on rows and columns.
 Features of Data Frame :
 It is size-mutable
 Has labelled axes
 Columns can be of different data types
 We can perform arithmetic operations on rows and columns.
 Structure :
PDS Algo SE INS
101
102
103
….
160
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 39
Data Frames (Cont.)
 Syntax :
syntax Parameters
import pandas as pd data = array like Iterable
df = pd.DataFrame(data,index,columns,dtype,copy=False) index = array like row index
columns = array like col index
dtype = data-type
copy = bool, default is False
 Example :
pdDataFrame.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 0 23 93 46
3 randArr = np.random.randint(0,100,20).reshape(5,4) 102 85 47 31 12
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 103 35 34 6 89
['PDS','Algo','SE','INS']) 104 66 83 70 50
print(df) 105 65 88 87 87
5

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 40
Data Frames (Cont.)
 Grabbing the column
dfGrabCol.py Output
1 import numpy as np 101 0
2 import pandas as pd 102 85
3 randArr = np.random.randint(0,100,20).reshape(5,4) 103 35
4 df = pd.DataFrame(randArr,np.arange(101,106,1), 104 66
['PDS','Algo','SE','INS']) 105 65
print(df['PDS']) Name: PDS, dtype: int32
5
 Grabbing the multiple column
Output
dfGrabMulCol.py
PDS SE
1 print(df['PDS', 'SE']) 101 0 93
102 85 31
103 35 6
104 66 70
105 65 87
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 - Capturing, Preparing and Working with Data 41
Data Frames (Cont.)
 Grabbing a row
dfGrabRow.py Output

1 print(df.loc[101]) # using labels PDS 0


2 #OR Algo 23
3 print(df.iloc[0]) # using zero based index SE 93
INS 46
Name: 101, dtype: int32
 Grabbing Single Value
dfGrabSingle.py Output
1 print(df.loc[101, 'PDS']) # using labels 0

 Deleting Row Output

dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('103',inplace=True) 102 85 47 31 12
2 print(df) 104 66 83 70 50
105 65 88 87 87

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 - Capturing, Preparing and Working with Data 42
Data Frames (Cont.)
 Creating new column Output

dfCreateCol.py
PDS Algo SE INS total
101 0 23 93 46 162
1 df['total'] = df['PDS'] + df['Algo'] + 102 85 47 31 12 175
df['SE'] + df['INS'] 103 35 34 6 89 164
2 print(df) 104 66 83 70 50 269
105 65 88 87 87 327
 Deleting Column and Row Output

dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('total',axis=1,inplace=True) 102 85 47 31 12
2 print(df) 103 35 34 6 89
104 66 83 70 50
105 65 88 87 87

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 43
Data Frames (Cont.)
 Getting Subset of Data Frame
dfGrabSubSet.py Output
1 print(df.loc[[101,104], [['PDS','INS']]) PDS INS
101 0 46
104 66 50

 Selecting all cols except one Output

dfGrabExcept.py PDS SE INS


101 0 93 46
1 print(df.loc[:, df.columns != 'Algo' ])
102 85 31 12
103 35 6 89
104 66 70 50
105 65 87 87

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 44
Conditional Selection
 Similar to NumPy we can do conditional selection in pandas.
dfCondSel.py Output
1 import numpy as np PDS Algo SE INS
2 import pandas as pd 101 66 85 8 95
3 np.random.seed(121) 102 65 52 83 96
4 randArr = 103 46 34 52 60
np.random.randint(0,100,20).reshape(5,4) 104 54 3 94 52
5 df = 105 57 75 88 39
pd.DataFrame(randArr,np.arange(101,106,1) PDS Algo SE INS
,['PDS','Algo','SE','INS']) 101 True True False True
6 print(df) 102 True True True True
7 print(df>50) 103 False False True True
104 True False True True
105 True True True False
 Note : we have used np.random.seed() method and set seed to be 121, so that when you
generate random number it matches with the random number I have generated.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 - Capturing, Preparing and Working with Data 45
Conditional Selection (Cont.)
 We can then use this boolean DataFrame to get associated values.
dfCondSel.py Output
1 dfBool = df > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 NaN 95
102 65 52 83 96
 Note : It will set NaN (Not a Number) in case of False 103 NaN NaN 52 60
104 54 NaN 94 52
105 57 75 88 NaN

 We can apply condition on specific column.


dfCondSel.py Output
1 dfBool = df['PDS'] > 50 PDS Algo SE INS
2 print(df[dfBool]) 101 66 85 8 95
102 65 52 83 96
104 54 3 94 52
105 57 75 88 39

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 46
Setting/Resetting index
 In our previous example we have seen our index does not have name, if we want to specify
name to our index we can specify it using DataFrame.index.name property.
dfCondSel.py Output
1 df.index.name('RollNo') PDS Algo SE INS
RollNo
101 66 85 8 95
102 65 52 83 96
Note: We have name to our
103 46 34 52 60
index now
104 54 3 94 52
105 57 75 88 39

 We can use pandas built-in methods to set or reset the index


 pd.set_index('NewColumn',inplace=True), will set new column as index,
 pd.reset_index(), will reset index to zero based numberic index.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 47
Setting/Resetting index (Cont.)
 set_index(new_index)
dfCondSel.py Output
1 df.set_index('PDS') #inplace=True Algo SE INS
PDS
66 85 8 95
65 52 83 96
Note: We have PDS as our
46 34 52 60
index now
54 3 94 52
 reset_index() 57 75 88 39
dfCondSel.py Output
1 df.reset_index() RollNo PDS Algo SE INS
0 101 66 85 8 95
Note: Our RollNo(index) 1 102 65 52 83 96
become new column, and 2 103 46 34 52 60
we now have zero based 3 104 54 3 94 52
numeric index 4 105 57 75 88 39

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 48
Multi-Index DataFrame
 Hierarchical indexes (AKA multiindexes) help us to organize, find, and aggregate information
faster at almost no cost. 
 Example where we need Hierarchical indexes
Numeric Index/Single Index Multi Index
Col Dep Sem RN S1 S2 S3 RN S1 S2 S3
0 ABC CE 5 101 50 60 70 Col Dep Sem
1 ABC CE 5 102 48 70 25 ABC CE 5 101 50 60 70
2 ABC CE 7 101 58 59 51 5 102 48 70 25
3 ABC ME 5 101 30 35 39 7 101 58 59 51
4 ABC ME 5 102 50 90 48 ME 5 101 30 35 39
5 Darshan CE 5 101 88 99 77 5 102 50 90 48
6 Darshan CE 5 102 99 84 76 Darshan CE 5 101 88 99 77
7 Darshan CE 7 101 88 77 99 5 102 99 84 76
8 Darshan ME 5 101 44 88 99 7 101 88 77 99
ME 5 101 44 88 99

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 49
Multi-Index DataFrame (Cont.)
 Creating multiindexes is as simple as creating single index using set_index method, only
difference is in case of multiindexes we need to provide list of indexes instead of a single
string index, lets see and example for that
dfMultiIndex.py Output
1 dfMulti = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv') Col Dep Sem
2 dfMulti.set_index(['Col','Dep','Sem'], ABC CE 5 101 50 60 70
inplace=True) 5 102 48 70 25
3 print(dfMulti) 7 101 58 59 51
ME 5 101 30 35 39
5 102 50 90 48
Darshan CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 50
Multi-Index DataFrame (Cont.)
 Now we have multi-indexed DataFrame from which we can access data using multiple index
 For Example Output (Darshan)
 Sub DataFrame for all the students of Darshan RN S1 S2 S3
dfGrabDarshanStu.py Dep Sem
1 print(dfMulti.loc['Darshan']) CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Output (Darshan->CE)
RN S1 S2 S3
Sem
5 101 88 99 77
dfGrabDarshanCEStu.py 5 102 99 84 76
 Sub DataFrame for Computer Engineering
7 101 88 77 99
1 print(dfMulti.loc['Darshan','CE'])
students from Darshan

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 51
Reading in Multiindexed DataFrame directly from CSV
 read_csv function of pandas provides easy way to create multi-indexed DataFrame directly
while fetching the CSV file.
dfMultiIndex.py Output
1 dfMultiCSV = RN S1 S2 S3
pd.read_csv('MultiIndexDemo.csv' Col Dep Sem
,index_col=[0,1,2]) ABC CE 5 101 50 60 70
#for multi-index in cols we can 5 102 48 70 25
use header parameter 7 101 58 59 51
2 print(dfMultiCSV) ME 5 101 30 35 39
5 102 50 90 48
Darshan CE 5 101 88 99 77
5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 52
Cross Sections in DataFrame
 The xs() function is used to get cross-section from the === Parameters ===
key : label
Series/DataFrame.
axis : Axis to retrieve
 This method takes a key argument to select data at a cross section
particular level of a MultiIndex. level : level of key
drop_level : False if you
 Syntax :
want to preserve the level
syntax
Output
DataFrame.xs(key, axis=0, level=None, drop_level=True)
 Example : RN S2RN S3S1 S2
S1 S3
Col
Col Dep
Sem Sem
dfMultiIndex.py
ABC
ABC CE
5 5 101 101
50 50
60 60
70 70
1 dfMultiCSV = 5 5 102 102
48 48
70 70
25 25
pd.read_csv('MultiIndexDemo.csv', 7 7 101 101
58 58
59 59
51 51
index_col=[0,1,2]) Darshan ME
5 5 101 101
88 30
99 35
77 39
2 print(dfMultiCSV) 5 5 102 102
99 50
84 90
76 48
3 print(dfMultiCSV.xs('CE',axis=0,level='Dep') Darshan CE
7 5 101 101
88 88
77 99
99 77
) 5 102 99 84 76
7 101 88 77 99
ME 5 101 44 88 99
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 53
Dealing with Missing Data
 There are many methods by which we can deal with the missing data, some of most commons
are listed below,
 dropna, will drop (delete) the missing data (rows/cols)
 fillna, will fill specified values in place of missing data
 interpolate, will interpolate missing data and fill interpolated value in place of missing data.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 54
Groupby in Pandas
 Any groupby operation involves one of the following
operations on the original object. They are College Enno CPI
 Splitting the Object Darshan 123 8.9
 Applying a function
Darshan 124 9.2
 Combining the results
Darshan 125 7.8
 In many situations, we split the data into sets and we Darshan 128 8.7 College Mean CPI
apply some functionality on each subset.
ABC 211 5.6 Darshan 8.65
 we can perform the following operations ABC 212 6.2 ABC 4.8
 Aggregation − computing a summary statistic
ABC 215 3.2 XYZ 5.83
 Transformation − perform some group-specific operation
 Filtration − discarding the data with some condition ABC 218 4.2
XYZ 312 5.2
 Basic ways to use of groupby method
 df.groupby('key') XYZ 315 6.5
 df.groupby(['key1','key2']) XYZ 315 5.8
 df.groupby(key,axis=1)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 55
Groupby in Pandas (Cont.)
 Example : Listing all the groups
dfGroup.py Output

1 dfIPL = pd.read_csv('IPLDataSet.csv') {2014: Int64Index([0, 2, 4, 9],


2 print(dfIPL.groupby('Year').groups) dtype='int64'),
2015: Int64Index([1, 3, 5, 10],
dtype='int64'),
2016: Int64Index([6, 8],
dtype='int64'),
2017: Int64Index([7, 11],
dtype='int64')}

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 56
Groupby in Pandas (Cont.)
 Example : Group by multiple columns
dfGroupMul.py Output

1 dfIPL = pd.read_csv('IPLDataSet.csv') {(2014, 'Devils'): Int64Index([2],


2 print(dfIPL.groupby(['Year','Team']).groups) dtype='int64'),
(2014, 'Kings'): Int64Index([4],
dtype='int64'),
(2014, 'Riders'): Int64Index([0],
dtype='int64'),
………
………
(2016, 'Riders'): Int64Index([8],
dtype='int64'),
(2017, 'Kings'): Int64Index([7],
dtype='int64'),
(2017, 'Riders'): Int64Index([11],
dtype='int64')}

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 57
Output
Groupby in Pandas (Cont.) 2014
Team Rank Year Points
 Example : Iterating through groups 0 Riders 1 2014 876
dfGroupIter.py 2 Devils 2 2014 863
4 Kings 3 2014 741
1 dfIPL = pd.read_csv('IPLDataSet.csv')
9 Royals 4 2014 701
2 groupIPL = dfIPL.groupby('Year')
2015
3 for name,group in groupIPL :
Team Rank Year Points
4 print(name)
1 Riders 2 2015 789
5 print(group)
3 Devils 3 2015 673
5 kings 4 2015 812
10 Royals 1 2015 804
2016
Team Rank Year Points
6 Kings 1 2016 756
8 Riders 2 2016 694
2017
Team Rank Year Points
7 Kings 1 2017 788
11 Riders 2 2017 690
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 58
Groupby in Pandas (Cont.)
Output
 Example : Aggregating groups
YEAR_ID
dfGroupAgg.py 2003 1000
1 dfSales = pd.read_csv('SalesDataSet.csv') 2004 1345
2 print(dfSales.groupby(['YEAR_ID']).count( 2005 478
)['QUANTITYORDERED']) Name: QUANTITYORDERED, dtype:
3 print(dfSales.groupby(['YEAR_ID']).sum() int64
['QUANTITYORDERED']) YEAR_ID
4 print(dfSales.groupby(['YEAR_ID']).mean() 2003 34612
['QUANTITYORDERED']) 2004 46824
2005 17631
Name: QUANTITYORDERED, dtype:
int64
YEAR_ID
2003 34.612000
2004 34.813383
2005 36.884937
Name: QUANTITYORDERED, dtype:
float64
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 59
Groupby in Pandas (Cont.)
Output
 Example : Describe details
count mean std min
dfGroupDesc.py 25% 50% 75% max
1 dfIPL = Year
pd.read_csv('IPLDataSet.csv') 2014 4.0 795.25 87.439026 701.0 731.0
2 print(dfIPL.groupby('Year').desc 802.0 866.25 876.0
ribe()['Points']) 2015 4.0 769.50 65.035888 673.0 760.0
796.5 806.00 812.0
2016 2.0 725.00 43.840620 694.0 709.5
725.0 740.50 756.0
2017 2.0 739.00 69.296465 690.0 714.5
739.0 763.50 788.0

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 60
Concatenation in Pandas
 Concatenation basically glues together DataFrames.
 Keep in mind that dimensions should match along the axis you are concatenating on.
 You can use pd.concat and pass in a list of DataFrames to concatenate together:
dfConcat.py Output
1 dfCX = pd.read_csv('CX_Marks.csv',index_col=0) PDS Algo SE
2 dfCY = pd.read_csv('CY_Marks.csv',index_col=0) 101 50 55 60
3 dfCZ = pd.read_csv('CZ_Marks.csv',index_col=0) 102 70 80 61
4 dfAllStudent = pd.concat([dfCX,dfCY,dfCZ]) 103 55 89 70
5 print(dfAllStudent) 104 58 96 85
201 77 96 63
202 44 78 32
 Note : We can use axis=1 parameter to concat columns. 203 55 85 21
204 69 66 54
301 11 75 88
302 22 48 77
303 33 59 68
304 44 55 62
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 61
Join in Pandas
 df.join() method will efficiently join multiple DataFrame objects by index(or column
specified) .
 some of important Parameters :
 dfOther : Right Data Frame
 on (Not recommended) : specify the column on which we want to join (Default is index)
 how : How to handle the operation of the two objects.
 left: use calling frame’s index (Default).
 right: use dfOther index.
 outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it.
lexicographically.
 inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order
of the calling’s one.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 62
Join in Pandas (Example)
dfJoin.py Output - 1 Output - 2
1 dfINS = PDS Algo SE INS PDS Algo SE INS
pd.read_csv('INS_Marks.csv',index_col=0) 101 50 55 60 55.0 301 11 75 88 11
2 dfLeftJoin = allStudent.join(dfINS) 102 70 80 61 66.0 302 22 48 77 22
3 print(dfLeftJoin) 103 55 89 70 77.0 303 33 59 68 33
4 dfRightJoin = 104 58 96 85 88.0 304 44 55 62 44
allStudent.join(dfINS,how='right') 201 77 96 63 66.0 101 50 55 60 55
5 print(dfRightJoin) 202 44 78 32 NaN 102 70 80 61 66
203 55 85 21 78.0 103 55 89 70 77
204 69 66 54 85.0 104 58 96 85 88
301 11 75 88 11.0 201 77 96 63 66
302 22 48 77 22.0 203 55 85 21 78
303 33 59 68 33.0 204 69 66 54 85
304 44 55 62 44.0

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 63
Merge in Pandas
 Merge DataFrame or named Series objects with a database-style join.
 Similar to join method, but used when we want to join/merge with the columns instead of index.
 some of important Parameters :
 dfOther : Right Data Frame
 on : specify the column on which we want to join (Default is index)
 left_on : specify the column of left Dataframe
 right_on : specify the column of right Dataframe
 how : How to handle the operation of the two objects.
 left: use calling frame’s index (Default).
 right: use dfOther index.
 outer: form union of calling frame’s index with other’s index (or column if on is specified), and sort it.
lexicographically.
 inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order
of the calling’s one.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 64
Merge in Pandas (Example)
dfMerge.py Output
1 m1 = pd.read_csv('Merge1.csv') RollNo EnNo Name
2 print(m1) 0 101 11112222 Abc
3 m2 = pd.read_csv('Merge2.csv') 1 102 11113333 Xyz
4 print(m2) 2 103 22224444 Def
5 m3 = m1.merge(m2,on='EnNo')
6 print(m3) EnNo PDS INS
0 11112222 50 60
1 11113333 60 70

RollNo EnNo Name PDS INS


0 101 11112222 Abc 50 60
1 102 11113333 Xyz 60 70

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 65
Read CSV in Pandas
 read_csv() is used to read Comma Separated Values (CSV) file into a pandas DataFrame.
 some of important Parameters :
 filePath : str, path object, or file-like object
 sep : separator (Default is comma)
 header: Row number(s) to use as the column names.
 index_col : index column(s) of the data frame.
readCSV.py Output
1 dfINS = pd.read_csv('Marks.csv',index_col=0,header=0) PDS Algo SE INS
2 print(dfINS) 101 50 55 60 55.0
102 70 80 61 66.0
103 55 89 70 77.0
104 58 96 85 88.0
201 77 96 63 66.0

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 66
Read Excel in Pandas
 Read an Excel file into a pandas DataFrame.
 Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or
URL. Supports an option to read a single sheet or a list of sheets.
 some of important Parameters :
 excelFile : str, bytes, ExcelFile, xlrd.Book, path object, or file-like object
 sheet_name : sheet no in integer or the name of the sheet, can have list of sheets.
 index_col : index column of the data frame.

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 67
Read from MySQL Database
 We need two libraries for that,
 conda install sqlalchemy
 conda install pymysql
 After installing both the libraries, import create_engine from sqlalchemy and import
pymysql
importsForDB.py
1 from sqlalchemy import create_engine
2 import pymysql

 Then, create a database connection string and create engine using it.
createEngine.py
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 68
Read from MySQL Database (Cont.)
 After getting the engine, we can fire any sql query using pd.read_sql method.
 read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL,
Oracle etc…)
readSQLDemo.py
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)
Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT

Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 69
Web Scrapping using Beautiful Soup
 Beautiful Soup is a library that makes it easy to scrape information from web pages.
 It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and
modifying the parse tree. Output
webScrap.py Dr. Gopi Sanghani
1 import requests Dr. Nilesh Gambhava
2 import bs4 Dr. Pradyumansinh
3 req = Jadeja
requests.get('https://www.darshan.ac.in/DIET/CE/Faculty') Prof. Hardik Doshi
4 soup = bs4.BeautifulSoup(req.text,'lxml') Prof. Maulik Trivedi
5 allFaculty = soup.select('body > main > section:nth- Prof. Dixita Kagathara
child(5) > div > div > div.col-lg-8.col-xl-9 > div > Prof. Firoz Sherasiya
div') Prof. Rupesh Vaishnav
6 for fac in allFaculty : Prof. Swati Sharma
7 allSpans = fac.select('h2>a') Prof. Arjun Bala
8 print(allSpans[0].text.strip()) Prof. Mayur Padia
…..
…..
Prof. Arjun V. Bala #3150713 (PDS)  Unit 03 – Capturing, Preparing and Working with Data 70

You might also like